This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: test_asyncio hanging for 1 hour (AIX version, hangs in test_subprocess_interactive)
Type: behavior Stage: resolved
Components: Tests Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: David.Edelsohn, db3l, flox, gvanrossum, larry, ncoghlan, neologix, pitrou, python-dev, sbt, skrah, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2013-10-19 04:20 by David.Edelsohn, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue19293.patch David.Edelsohn, 2013-10-19 15:40 Skip some signal tests in test_events.py review
asyncio_signal.diff neologix, 2013-10-19 19:53 review
sockpair.diff gvanrossum, 2013-10-21 17:08
issue19293.patch2 David.Edelsohn, 2013-10-22 02:47 review
issue19293.patch3 David.Edelsohn, 2013-10-22 03:55 review
aixfix2.diff gvanrossum, 2013-10-22 04:27 review
Messages (72)
msg200364 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-19 04:20
test_asyncio times out after one hour on AIX and leaves a process consuming 100% of a thread.

[145/382/3] test_asyncio
Timeout (1:00:00)!
Thread 0x00000001:
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/selectors.py", line 265 in select
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 576 in _run_once
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 153 in run_forever
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 172 in run_until_complete
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/test_asyncio/test_events.py", line 1012 in test_subprocess_interactive
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/case.py", line 571 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/case.py", line 610 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/runner.py", line 168 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/support/__init__.py", line 1661 in _run_suite
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/support/__init__.py", line 1695 in run_unittest
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/test_asyncio/__init__.py", line 26 in test_main
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 1276 in runtest_inner
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 965 in runtest
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 761 in main
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/__main__.py", line 3 in <module>
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/runpy.py", line 73 in _run_code
  File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/runpy.py", line 160 in _run_module_as_main
make: 1254-004 The error code from the last command is 1.
msg200372 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 05:07
Is that traceback upside down compared to normal Python tracebacks?
msg200374 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 05:12
Assuming it is, it is hanging in the poll() syscall, here:
http://hg.python.org/cpython/file/f3a2dab1623b/Lib/selectors.py#265

Could you see if it works better with select()? Comment out lines 402, 403 in that file.
msg200426 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-19 15:40
Commenting out poll in selectors.py does not help.

AIX has its own epoll/kqueue efficient I/O event polling feature "pollset", which I will open as a separate enhancement request issue.

Some of the subprocess tests in test_asyncio/test_events.py related to signals are hanging on AIX: test_subprocess_interactive, test_subprocess_kill and test_subprocess_send_signal all hang. This probably is related to Issue #18238 and the tests skipped in test_signal.py.

The attached patch skips those tests on AIX.
msg200433 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-19 16:10
New changeset 41a841fac7fb by Guido van Rossum in branch 'default':
Disable some subprocess tests that hang on AIX.
http://hg.python.org/cpython/rev/41a841fac7fb
msg200435 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 16:18
Thanks for the investigation and the patch.

As to the deeper issue, I'm not sure I understand what's going on -- is this a bug in Python or in AIX?  Is there a better work-around than disabling the test?  The issue 18238 that you mention seems to be closed by disabling tests too, but that seems unsatisfactory.

Is the real issue that select() and poll() on AIX cannot be interrupted by signals?  That sounds really sad.
msg200437 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-19 16:25
Does AIX have an equivalent to Linux's strace?
If yes, having a sample output would be helpful.

Otherwise, maybe just adding manual debug traces to PollSelector.select() would help to see which events are reported.
msg200438 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-10-19 16:26
I think the failure is not only on AIX. The FreeBSD bots had similar
failures, and here is one on PPC Linux:

http://buildbot.python.org/all/builders/PPC64%20PowerLinux%203.x/builds/881/steps/test/logs/stdio

Timeout (1:00:00)!
Thread 0x0000008074e46670:
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/selectors.py", line 313 in select
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 576 in _run_once
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 153 in run_forever
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 172 in run_until_complete
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/test_asyncio/test_events.py", line 1078 in test_subprocess_close_after_finish
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/case.py", line 571 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/case.py", line 610 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/runner.py", line 168 in run
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/support/__init__.py", line 1661 in _run_suite
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/support/__init__.py", line 1695 in run_unittest
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/test_asyncio/__init__.py", line 26 in test_main
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 1276 in runtest_inner
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 965 in runtest
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 761 in main
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/__main__.py", line 3 in <module>
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/runpy.py", line 73 in _run_code
  File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/runpy.py", line 160 in _run_module_as_main
make: *** [buildbottest] Error 1
msg200440 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 16:40
[Off-topic: Heck, another one of those upside-down tracebacks. WHY???]

I'll need a lot of help from folks on the affected platforms in debugging this, since these tests work fine when I run them on the platforms to which I have access (Ubuntu and OS X).

I would also like to point out http://code.google.com/p/tulip/issues/detail?id=67 (and the thread referenced in my last message there) as well as http://code.google.com/p/tulip/issues/detail?id=68.  Perhaps there are clues there.

I hope that we'll be able to sort this out before the beta release.
msg200449 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-19 17:36
My understanding of the issue is POSIX and The Open Group specification are ambiguous about the handling of Unix socket API calls when interrupted by a signal. Linux implements a more liberal, accommodating behavior and the Python Async I/O implementation is assuming those semantics. AIX inherited a lot from OSF, Mac OS X/Darwin inherited a lot from FreeBSD, both of which adopted more "unforgiving behavior".

The following web page seems to summarize the issue:

http://www.madore.org/~david/computers/connect-intr.html

I don't know if the description still is accurate for BSDs, but AIX likely implemented more pedantic, unforgiving semantics.

The standard seems to allow the behavior seen in AIX and FreeBSD, regardless of whether it is less useful / pleasant. From a pedantic standpoint, I think this is a bug in the Python implementation of Async I/O because it assumes behavior (implemented by Linux) not required by an implementation that conforms to the POSIX and Open Group specifications.
msg200452 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-19 17:47
Also, the Python buildbot for AIX is running on a POWER AIX 7.1 system at OSU OSL that is a member of the "GCC Compile Farm" cluster on which FOSS developers can obtain free accounts. If you want access to an AIX system for testing, it's available.
msg200456 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-19 18:03
Wait, I think there's a misunderstanding here.
POSIX does indeed leave undefined the behavior of syscalls when a
signal is received: whether or not it fails with EINTR depends on the
operating system, the SA_RESTART flag, etc.

But the tests we're talking about send signals whose default actions
are to terminate the process (e.g. SIGTERM, SIGHUP, etc): I cannot
beleive AIX doesn't terminate the processes in these situations, I
think it's a bug in asyncio which prevent the termination from being
detected (David, you can just use "ps" to check that the processes are
indeed terminated).
msg200461 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-19 18:09
Check that which processes are terminated? The test eventually timed out after 1 hour and the process running echo.py continued to run.
msg200463 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 18:11
Actually, I think we're talking about the actual select() syscal not
returning when SIGCHLD is received and handled. Running the C-level handler
isn't enough, the syscall must return with EINTR so the Python handler can
run (which then writes a byte to the self-pipe to ensure select() returns,
but that seems too late). Maybe twiddling with SA_RESTART can fix this?
msg200466 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-19 18:17
poll() vs select() does not make a difference. Using the default poll(), if I run python in GDB, test_process_interactive hangs in poll(). The top of the traceback looks like:

#0  0xd02389d4 in __fd_poll () from /usr/lib/libc.a(shr.o)
#1  0xd717e604 in poll (__listptr=0x2023e150, __nfds=1, __timeout=-1)
    at /usr/include/sys/poll.h:107
#2  0xd71800f8 in poll_poll (self=0x30696d10, args=0x304723b4)
    at /home/dje/src/cpython/Modules/selectmodule.c:543
msg200470 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-19 18:24
> Guido van Rossum added the comment:
>
> Actually, I think we're talking about the actual select() syscal not
> returning when SIGCHLD is received and handled. Running the C-level handler
> isn't enough, the syscall must return with EINTR so the Python handler can
> run (which then writes a byte to the self-pipe to ensure select() returns,
> but that seems too late). Maybe twiddling with SA_RESTART can fix this?

Hum, I just check asyncio's code, and it's not doing what I thought it
was doing.

SA_RESTART won't help: PyOS_setsig() already doesn't set it, and also
calls siginterrupt() to ensure the syscall will be interrupted, but as
can been seen here, that's not portable.

The correct way to do this, IMO, would be to simply change
BaseSelectorEventLoop._read_from_self() to read the signal number from
the wakeup FD (Python's signal handler writes the signal number to
this FD): then, we could call directly call the relevant signal
handler. This way, this would succeed even if select()/poll()/whatever
doesn't fail with EINTR (it'll also be a bit simpler/faster).
I thought that's how it was done in tulip!
msg200475 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-19 18:44
> [Off-topic: Heck, another one of those upside-down tracebacks. WHY???]

These are tracebacks emitted by the faulthandler module. They don't use the normal traceback printing machinery because it is impossible to do so in the conditions where those tracebacks are printed (e.g. from a signal handler).
msg200487 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-19 19:53
Here's a patch changing asyncio to not rely on a Python-level signal handler at all: it reads the signal number from the self-pipe.
It should therefore work even if the select()/poll()/etc syscall doesn't fail with EINTR, and incidentally require only one wakup per signal.
Also, it gets rid of BaseEventLoop._add_callback_signalsafe().

I just had to update a test which was timing-dependent (basically, before when a signal was received you had to add the handler to the ready list and issue a self-write to wakeup the main loop, now the handler is added to the ready list and the loop is automatically woken up).

This should fix the PPC Linux and FreeBSD failures.

I'm not so sure about AIX, since the error looks different.
msg200494 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-19 20:35
The patch cleans up a lot of the other failures for asyncio on AIX, but does not fix the hang.
msg200505 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-19 22:09
> The patch cleans up a lot of the other failures for asyncio on AIX

Great.

> but does not fix the hang.

I was expecting this :-)

Note that the hang is really strange, especially since the first
run_until_complete() succeeds.

Without more information, it'll be really hard to debug.

If you don't have an strace equivalent, then the only option is to add
ad-hoc debug statements to see:
- if the data gets written to the pipe
- if data is read from the pipe

There's something strange in the stack you posted above: there seems
to be a single FD in the poll() set, whereas there should be at least
3 (stdin, stdout and stderr).
msg200510 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 22:50
I guess it's a shame that the two issues got merged, since now it looks like they may have to be unmerged again. I sent a few comments to neologix' patch that should be easy to address; for AIX I'm afraid we're still looking for more info from David.
msg200513 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-19 23:13
The exact same hanging seems to happen under Ubuntu 8.04 LTS:

http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8803/steps/test/logs/stdio

[176/383] test_asyncio
Timeout (1:00:00)!
Thread 0x404e48c0:
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/selectors.py", line 313 in select
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 576 in _run_once
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 153 in run_forever
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 172 in run_until_complete
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_asyncio/test_events.py", line 1081 in test_subprocess_close_after_finish
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 571 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 610 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 168 in run
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 1661 in _run_suite
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 1695 in run_unittest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_asyncio/__init__.py", line 31 in test_main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 1276 in runtest_inner
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 965 in runtest
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 761 in main
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/__main__.py", line 3 in <module>
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 73 in _run_code
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 160 in _run_module_as_main
msg200514 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013-10-19 23:14
There's also a test that hangs for 1 hour on the Ubuntu x86 buildbot.

Sample failure:
http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8803/
msg200516 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 23:26
If CF's patch fixes the Ubuntu hang, commit it!  We can iterate on my nits post-alpha.
msg200518 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-19 23:34
> Guido van Rossum added the comment:
>
> Actually, I think we're talking about the actual select() syscal not
> returning when SIGCHLD is received and handled. Running the C-level handler
> isn't enough, the syscall must return with EINTR so the Python handler can
> run (which then writes a byte to the self-pipe to ensure select() returns,
> but that seems too late).

I just realized the following: there's no need for EINTR to break out
of select(), that's what the self-pipe is for: since the write-end is
registered as wakeup FD, and the read-end is registered in the
selector, when the C-level signal handler is executed, it writes to
the wakeup FD, therefore the select() syscall should return with the
read-end ready, or more likely fail with EINTR. And then, the
Python-level signal handler should be executed (from the main loop),
scheduling a callback, and re-trigerring a wakup through the
self-pipe.

If this doesn't work as-is, I only see two explanations:
- the C-level signal handler doesn't get called, which seems impossible
- the python-level signal handler doesn't get executed before the main
loop re-enters select(), but since the new GIL signal handlers are
called at each eval loop

In short, I'm puzzled. I can tell why my patch should work, but I'm
not sure why the current approach doesn't.
msg200520 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 23:42
Hm... Maybe the wakeup_fd machinery is broken on some systems?

Has anybody been able to confirm that CF's patch fixes the hang on any non-AIX system? If so, maybe we should commit first, understand later...
msg200521 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-19 23:43
I don't have access to an Ubuntu 8.04 box, but if David Bolen can test the patch it would be great.
msg200525 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-19 23:55
Separate question for David: if CF's patch were committed, would we still need your patch to disable three subprocess-related tests on AIX? (That can be dealt with after the alpha release though -- in fact I propose not to wait for AIX if we can get the other platforms to stop hanging.)
msg200530 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-20 01:29
RHEL 6 too:
http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2911/steps/test/logs/stdio
msg200531 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-20 01:31
AIX has an equivalent to strace (called truss). I have recorded all AIX system calls and signals for test_process_interactive, which hangs, following all children created by fock. The uncompressed file is 82MB or 939KB compressed.  The highlights are:

test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ... 
4588380:        64422135: close(3)                              = 0
4588380:        64422135: statx("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", 0x2FF20870, 128, 010) = 0
4588380:        64422135: kopen("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", O_RDONLY|O_LARGEFILE) = 3
4588380:        64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
4588380:        64422135: kfcntl(3, F_SETFD, 0x00000001)        = 0
4588380:        64422135: fstatx(0x00000003, 0x2FF229E8, 0x00000080, 0x00000008) = 0x00000000
4588380:        64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
4588380:        64422135: lseek(0x00000003, 0x00000000, 0x00000001) = 0x00000000
4588380:        64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
4588380:        64422135: kread(3, " i m p o r t   o s\n\n i".., 4096) = 110
4588380:        64422135: klseek(3, 4294967295, -108, 0x00000001) = 0
4588380:        64422135: klseek(3, 0, 0, 0x00000000)           = 0
4588380:        64422135: kread(3, " i m p o r t   o s\n\n i".., 4096) = 110
4588380:        64422135: kread(3, " i m p o r t   o s\n\n i".., 4096) = 0
4588380:        64422135: close(3)                              = 0
4588380:        64422135: kread(0, " P y t h o n   � � � � �".., 1024) = 7
4588380:        64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000007) = 0x00000007
4588380:             0x00000000: " P y t h o n   � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, -1)            = 1
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
59244954: 33161535: __libc_sbrk(0x00000000)             = 0x20691220
4588380:        64422135: kwrite(0x00000001, 0x200CD140, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD190, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: kread(9, " P y t h o n   � � � � �".., 262144) = 7
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD1E0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, 0)             = 0
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200D66C0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, 0)             = 0
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
59244954: 33161535: _poll(0x2067A970, 3, 0)             = 04588380:     64422135: kread(0, " � � � � � � � � � � � �"..
, 1024)
 = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
59244954: 33161535: _poll(0x2067A970, 3, 0)             = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD140, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD190, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD1E0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200D66C0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..
4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
4588380:        64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000000) = 0x00000000
4588380:             0x00000000: " � � � � � � � � � � � �"..

And it sits in this read/write loop with some intermittent _poll timeouts until I send SIGINT.
msg200532 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-20 01:35
Separate question for David: if CF's patch were committed, would we still need your patch to disable three subprocess-related tests on AIX? (That can be dealt with after the alpha release though -- in fact I propose not to wait for AIX if we can get the other platforms to stop hanging.)

CF's patch does not fix the hangs on AIX and my patch still is needed.  And I never was suggesting that correct functionality on AIX is a blocker.
msg200533 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-20 01:40
> test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ... 
> 4588380:        64422135: close(3)                              = 0
> 4588380:        64422135: statx("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", 0x2FF20870, 128, 010) = 0
> 4588380:        64422135: kopen("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", O_RDONLY|O_LARGEFILE) = 3
> 4588380:        64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
> 4588380:        64422135: kfcntl(3, F_SETFD, 0x00000001)        = 0
> 4588380:        64422135: fstatx(0x00000003, 0x2FF229E8, 0x00000080, 0x00000008) = 0x00000000
> 4588380:        64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
> 4588380:        64422135: lseek(0x00000003, 0x00000000, 0x00000001) = 0x00000000
> 4588380:        64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY
> 4588380:        64422135: kread(3, " i m p o r t   o s\n\n i".., 4096) = 110
> 4588380:        64422135: klseek(3, 4294967295, -108, 0x00000001) = 0
> 4588380:        64422135: klseek(3, 0, 0, 0x00000000)           = 0
> 4588380:        64422135: kread(3, " i m p o r t   o s\n\n i".., 4096) = 110
> 4588380:        64422135: kread(3, " i m p o r t   o s\n\n i".., 4096) = 0
> 4588380:        64422135: close(3)                              = 0
> 4588380:        64422135: kread(0, " P y t h o n   � � � � �".., 1024) = 7
> 4588380:        64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000007) = 0x00000007
> 4588380:             0x00000000: " P y t h o n   � � � � �"..
> 59244954: 33161535: _poll(0x2067A970, 3, -1)            = 1
> 4588380:        64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0
> 4588380:        64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000
> 4588380:             0x00000000: " � � � � � � � � � � � �"..

It seems echo.py should at least exit on EOF, which would prevent the
hangs (right now it's stuck in a busy loop calling os.read(0) which
always returns immediately).

(same for echo3.py)
msg200534 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-20 02:45
I added

        if not buf: break

after
        buf = os.read()

to echo.py, echo2.py and echo3.py.  They now exit and do not spin, but the test still hangs.
msg200541 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-20 03:56
I wonder if subprocess pipes have a version of the race condition I fixed
for sockets yesterday:

http://code.google.com/p/tulip/source/detail?r=2d392599fa067038b5fc9eeaf051011307aeadee

The pipe code seems derived from the socket code long ago.
msg200543 - (view) Author: David Bolen (db3l) * Date: 2013-10-20 04:20
I've been trying to test this on the Ubuntu 8.04 buildbot, but so far, using the latest hg default head, haven't gotten test_asyncio to hang even before the patch.  It also looks like the last 4 regular buildbot builds were fine too.  I even tried rolling back to one of the revisions that did fail a buildbot build, but it went ok as well.

I'm guessing it's just a timing or race condition but without some way to reproduce reliably, I can't say much about the patch.
msg200544 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-20 04:30
Let's just release the alpha with this as it stands; I will look at the race condition later, and also at CF's patch (which may be a good idea regardless of what it fixes).

I did add the breaks to echo and echo3 since the busy-waiting just makes tracing this that much harder.
msg200547 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-20 06:37
I'm trying to let go of the AIX hang. Here's a brain dump of what I've figured out so far.

* There were a lot of red herrings in the early discussion. This hang doesn't seem to have anything to do with nonblocking connect() or sockets, nor even signals.

* Summary of what the test (test_subprocess_interactive) tries to do: it starts an echo subprocess, writes a string to it, reads the string back, writes another string to it, reads that back, and then closes the transport.

* The test hangs after seeing the first string echoed back but not the second, and in between somehow the stdin pipe is broken.

* If I read David's truss log correctly, the following things have happened:

- the parent wrote 'Python ' to the pipe for the subprocess's stdin (this is not shown in the extract but it must have happened because we see the string arrive in the subprocess)
- the echo.py subprocess started and began to read from stdin
- the subprocess read 'Python ' from its stdin
- the subprocess wrote 'Python ' back to its stdout
- poll() in the parent woke up
- the parent allocated some memory and read 'Python ' from the pipe for the subprocess's stdout

At this point apparently the pipe for the subprocess stdin got closed so the subprocess received an EOF (over and over due to the missing test+break).

We also know that the parent now hangs in the last run_until_complete() call, which means that it has at least attempted to write 'The Winner' -- but there is no evidence of this in the truss extract so it is possible that that string is still in the transport's write buffer. It is also possible that David simply missed it in the endless stream of ineffective calls due to the looping bug.

I'm actually curious why it seems that poll() keeps returning 0 in the parent -- shouldn't it have an infinite timeout, since there's nothing left to do?

Another theory is one or more *connection_lost() methods on the protocol are actually being called but the test stubbornly keeps waiting until proto.got_data[1] becomes set.

I'd be very interested in the truss output with the fix to echo.py in place (which is now in the repo).
msg200551 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-20 08:26
> David Edelsohn added the comment:
>
> AIX has an equivalent to strace (called truss). I have recorded all AIX system calls and signals for test_process_interactive, which hangs, following all children created by fock. The uncompressed file is 82MB or 939KB compressed.  The highlights are:

Apparently, the stdout pipe was closed by the parent process: it would
be nice to see where/how (it should be FD 7 according to strace on my
Linux box).
msg200552 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-20 08:42
> Apparently, the stdout pipe was closed by the parent process

Could it be that selecting for *read* on the *write* end of a pipe is
always ready? In _UnixWritePipeTransport there's a read handler that
immediately closes the pipe as soon as it called. I vaguely remember a
discussion on python-tulip that this might be Linux-specific behavior. (The
reason is that otherwise you can't find out whether the other end was
closed unless you attempt to write to the pipe.)

To test this theory, it should be sufficient to comment out line 280

self._loop.add_reader(self._fileno, self._read_ready)

from unix_events.py. This will make test_subprocess_kill() hang but it
should not affect test_subprocess_interactive. (So it is not a fix, just a
way to confirm the theory.)
msg200553 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-20 08:52
> Guido van Rossum added the comment:
>
>> Apparently, the stdout pipe was closed by the parent process
>
> Could it be that selecting for *read* on the *write* end of a pipe is
> always ready?

That's exactly what I was thinking when I read the code below: that's
definitely a possibility on AIX.

> In _UnixWritePipeTransport there's a read handler that
> immediately closes the pipe as soon as it called. I vaguely remember a
> discussion on python-tulip that this might be Linux-specific behavior. (The
> reason is that otherwise you can't find out whether the other end was
> closed unless you attempt to write to the pipe.)

Normally, if the pipe is closed, it should be reported by a POLLHUP/write ready.
Then you only consider it closed when write fails with EPIPE.
msg200582 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-20 13:44
> To test this theory, it should be sufficient to comment out
> self._loop.add_reader(self._fileno, self._read_ready)

When I comment out this line, test_subprocess_interactive succeeds on AIX.
msg200589 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-20 14:08
For completeness, the highlights of the new truss trace output after the echo.py change and only tracing the main process to avoid confusion from the interleaved output:

test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ...
socketpair(1, 1, 0, 0x2FF112F0)                 = 0
kfcntl(5, F_SETFD, 0x00000001)                  = 0
kfcntl(6, F_SETFD, 0x00000001)                  = 0
kfcntl(5, F_GETFL, 0x00000000)                  = 2
kioctl(5, -2147195266, 0x2FF12610, 0x00000000)  = 0
kioctl(5, -2147195267, 0x2FF12610, 0x00000000)  = 0
kfcntl(5, F_SETFL, 0x00000006)                  = 0
kfcntl(6, F_GETFL, 0x00000000)                  = 2
kioctl(6, -2147195266, 0x2FF12610, 0x00000000)  = 0
kioctl(6, -2147195267, 0x2FF12610, 0x00000000)  = 0
kfcntl(6, F_SETFL, 0x00000006)                  = 0
_poll(0x201A9B00, 1, 0)                         = 0
fstatx(6, 0x2FF0F0F0, 128, 010)                 = 0
sigprocmask(2, 0xF0296960, 0x2FF0F0A0)          = 0
_sigaction(20, 0x2FF0F0F8, 0x2FF0F108)          = 0
thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0x14260153, 0x00000003, 0x00000110, 0x00000000, 0x00000000) = 0x00000000
pipe(0x2FF0E2F0)                                = 0
kfcntl(7, F_SETFD, 0x00000001)                  = 0
kfcntl(8, F_SETFD, 0x00000001)                  = 0
pipe(0x2FF0E2F0)                                = 0
kfcntl(9, F_SETFD, 0x00000001)                  = 0
kfcntl(10, F_SETFD, 0x00000001)                 = 0
pipe(0x2FF0E2F0)                                = 0
kfcntl(11, F_SETFD, 0x00000001)                 = 0
kfcntl(12, F_SETFD, 0x00000001)                 = 0
fstatx(8, 0x2FF0E748, 128, 010)                 = 0
fstatx(8, 0x2FF0E748, 128, 010)                 = 0
kioctl(8, 22528, 0x00000000, 0x00000000)        Err#25 ENOTTY
fstatx(9, 0x2FF0E748, 128, 010)                 = 0
fstatx(9, 0x2FF0E748, 128, 010)                 = 0
kioctl(9, 22528, 0x00000000, 0x00000000)        Err#25 ENOTTY
fstatx(11, 0x2FF0E748, 128, 010)                = 0
fstatx(11, 0x2FF0E748, 128, 010)                = 0
kioctl(11, 22528, 0x00000000, 0x00000000)       Err#25 ENOTTY
pipe(0x2FF0E1C0)                                = 0
kfcntl(13, F_SETFD, 0x00000001)                 = 0
kfcntl(14, F_SETFD, 0x00000001)                 = 0
sigprocmask(0, 0xF0296960, 0xF0296968)          = 0
kfork()                                         = 51773770
thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0xD0528F00, 0x00000000, 0x14260153, 0x14260153, 0x00000000) = 0x00000000
_getpid()                                       = 65077750
close(14)                                       = 0
close(7)                                        = 0
close(10)                                       = 0
close(12)                                       = 0
kread(13, " � � � � � � � � � � � �".., 50000)  = 0
close(13)                                       = 0
fstatx(8, 0x2FF0DB58, 128, 010)                 = 0
kfcntl(8, F_GETFL, 0x00000000)                  = 1
kioctl(8, -2147195266, 0x2FF0CDD0, 0x00000000)  = 0
kioctl(8, -2147195267, 0x2FF0CDD0, 0x00000000)  = 0
kfcntl(8, F_SETFL, 0x00000005)                  = 0
_poll(0x204DBB90, 2, 0)                         = 0
_poll(0x204DBB90, 2, 0)                         = 0
kfcntl(9, F_GETFL, 0x00000000)                  = 0
kioctl(9, -2147195266, 0x2FF0C360, 0x00000000)  = 0
kioctl(9, -2147195267, 0x2FF0C360, 0x00000000)  = 0
kfcntl(9, F_SETFL, 0x00000004)                  = 0
_poll(0x204DBB90, 3, 0)                         = 0
_poll(0x204DBB90, 3, 0)                         = 0
kfcntl(11, F_GETFL, 0x00000000)                 = 0
kioctl(11, -2147195266, 0x2FF0C360, 0x00000000) = 0
kioctl(11, -2147195267, 0x2FF0C360, 0x00000000) = 0
kfcntl(11, F_SETFL, 0x00000004)                 = 0
_poll(0x2050B8C0, 4, 0)                         = 0
_poll(0x2050B8C0, 4, 0)                         = 0
_poll(0x2050B8C0, 4, 0)                         = 0
_poll(0x2050B8C0, 4, 0)                         = 0
kwrite(8, " P y t h o n  ", 7)                  = 7
_poll(0x2050B8C0, 4, 0)                         = 1
_poll(0x2050B8C0, 3, 0)                         = 0
close(8)                                        = 0
_poll(0x2050B8C0, 3, 0)                         = 0
_poll(0x2050B8C0, 3, -1)                        = 1
__libc_sbrk(0x00000000)                         = 0x20541130
kread(9, " P y t h o n   � � � � �".., 262144)  = 7
_poll(0x2050B8C0, 3, 0)                         = 0
_poll(0x2050B8C0, 3, 0)                         = 0
_poll(0x2050B8C0, 3, 0)                         = 0
_poll(0x2050B8C0, 3, 0)                         = 0
_poll(0x2050B8C0, 3, -1)                        = 2
kread(9, " � � � � � � � � � � � �".., 262144)  = 0
kread(11, " � � � � � � � � � � � �".., 262144) = 0
_poll(0x2050B8C0, 1, 0)                         = 0
close(9)                                        = 0
close(11)                                       = 0
    Received signal #20, SIGCHLD [caught]
_getpid()                                       = 65077750
kwrite(6, "14", 1)                              = 1
ksetcontext_sigreturn(0x2FF11D70, 0x200868EC, 0x00000008, 0xD0528CA4, 0x00000000, 0x14260153, 0x2FF22FF8, 0x00000000)
_poll(0x2050B8C0, 1, 0)                         = 1
_erecv(5, 0x3075C820, 1, 0, 0x00000000)         = 1
_poll(0x2050B8C0, 1, 0)                         = 0
kwaitpid(0x2FF12134, 0, 5, 0x00000000, 0x00000000) = 51773770
_poll(0x2050B8C0, 1, 0)                         = 0
_poll(0x2050B8C0, 1, -1)        (sleeping...)
msg200597 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-20 16:55
> > Could it be that selecting for *read* on the *write* end of a pipe is
> > always ready?
>
> That's exactly what I was thinking when I read the code below: that's
> definitely a possibility on AIX.

David confirmed that it is the _read_ready() that closes the pipe
prematurely. (Thanks David!)

> > In _UnixWritePipeTransport there's a read handler that
> > immediately closes the pipe as soon as it called. I vaguely remember a
> > discussion on python-tulip that this might be Linux-specific behavior. (The
> > reason is that otherwise you can't find out whether the other end was
> > closed unless you attempt to write to the pipe.)
>
> Normally, if the pipe is closed, it should be reported by a POLLHUP/write ready.
> Then you only consider it closed when write fails with EPIPE.

That's no good though unless the app wants to write -- the write end
also reports write ready when it is empty (or nearly so), so this
won't help an app that has done all the writing it wants to (at least
for a while) but still wants to be notified when the other end has
closed it. Now, there are other indications, such as the pipe in the
other direction giving an EOF, or the process exiting, but sometimes
neither of those happens, and then we're stuck.

I guess we'll have to write platform-dependent code and make this an
optional feature. (Essentially, on platforms like AIX, for a
write-pipe, connection_lost() won't be called unless you try to write
some more bytes to it.)

I do believe that so far this problem only occurs on AIX so I am
tempted to make it an explicit test for AIX -- if it's AIX, don't
register the _read_ready handler. We'll also have to skip or adjust
one or two tests that will fail if this feature is missing.
msg200598 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-20 17:36
> I guess we'll have to write platform-dependent code and make this an
> optional feature. (Essentially, on platforms like AIX, for a
> write-pipe, connection_lost() won't be called unless you try to write
> some more bytes to it.)
>
> I do believe that so far this problem only occurs on AIX so I am
> tempted to make it an explicit test for AIX -- if it's AIX, don't
> register the _read_ready handler. We'll also have to skip or adjust
> one or two tests that will fail if this feature is missing.

Hmm...
Not calling connection_lost() when the read-end of the pipe is closed
is OK, since there's no portable way to detect this on e.g. AIX, but
AFAICT the current process-termination logic is buggy:
When SIGCHLD is received, _sig_chld() is executed:

    def _sig_chld(self):
        [...]
        transp = self._subprocesses.get(pid)
        if transp is not None:
            transp._process_exited(returncode)

Then, here's _process_exited():

    def _process_exited(self, returncode):
        assert returncode is not None, returncode
        assert self._returncode is None, self._returncode
        self._returncode = returncode
        self._loop._subprocess_closed(self)
        self._call(self._protocol.process_exited)
        self._try_finish()

And here's _try_finish():

    def _try_finish(self):
        assert not self._finished
        if self._returncode is None:
            return
        if all(p is not None and p.disconnected
               for p in self._pipes.values()):
            self._finished = True
            self._loop.call_soon(self._call_connection_lost, None)

Thus, _UnixSubprocessTransport protocol's connection_lost is only
called if the all() expression is true:
and it's true only if all the subprocess pipes have been disconnected
(or if we didn't setup any pipe).

Unfortunately, this might very well never happen: imagine that the
subprocess forks a process: this grand-child process inherits the
child process's pipes, so when the child process exits, we won't
receive any notification, since this grand-child process still has
open FDs pointing to the original child's stdin/stdout/stderr.

The following simple test will hang until the background 'sleep' exits:
"""
diff -r 47618b00405b Lib/test/test_asyncio/test_events.py
--- a/Lib/test/test_asyncio/test_events.py      Sat Oct 19 10:45:48 2013 +0300
+++ b/Lib/test/test_asyncio/test_events.py      Sun Oct 20 19:32:37 2013 +0200
@@ -1059,6 +1059,23 @@

     @unittest.skipIf(sys.platform == 'win32',
                      "Don't support subprocess for Windows yet")
+    def test_subprocess_inherit_fds(self):
+        proto = None
+
+        @tasks.coroutine
+        def connect():
+            nonlocal proto
+            transp, proto = yield from self.loop.subprocess_shell(
+                functools.partial(MySubprocessProtocol, self.loop),
+                'sleep 60 &')
+            self.assertIsInstance(proto, MySubprocessProtocol)
+
+        self.loop.run_until_complete(connect())
+        self.loop.run_until_complete(proto.completed)
+        self.assertEqual(0, proto.returncode)
+
+    @unittest.skipIf(sys.platform == 'win32',
+                     "Don't support subprocess for Windows yet")
     def test_subprocess_close_after_finish(self):
         proto = None
         transp = None
"""

If waitpid() returns a process's PID, then the process is done,
there's no reason to further wait for pipe's disconnection: they can
be used as a hint that the process terminated, but there's definitely
not required...
msg200599 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-20 17:37
> I guess we'll have to write platform-dependent code and make this an
> optional feature. (Essentially, on platforms like AIX, for a
> write-pipe, connection_lost() won't be called unless you try to write
> some more bytes to it.)

If we are not capturing stdout/stderr then we could "leak" the write end of a pipe to the child.  When the read end becomes readable we can call the process protocol's connection_lost().

Or we could just call connection_lost() when reaping the pid.
msg200615 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-20 19:01
No, there's a use case for reading after the child exited, if there is a
grandchild still writing.

--Guido van Rossum (sent from Android phone)
On Oct 20, 2013 10:37 AM, "Richard Oudkerk" <report@bugs.python.org> wrote:

>
> Richard Oudkerk added the comment:
>
> > I guess we'll have to write platform-dependent code and make this an
> > optional feature. (Essentially, on platforms like AIX, for a
> > write-pipe, connection_lost() won't be called unless you try to write
> > some more bytes to it.)
>
> If we are not capturing stdout/stderr then we could "leak" the write end
> of a pipe to the child.  When the read end becomes readable we can call the
> process protocol's connection_lost().
>
> Or we could just call connection_lost() when reaping the pid.
>
> ----------
> nosy: +sbt
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue19293>
> _______________________________________
>
msg200666 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-21 00:19
[Off-topic: Heck, another one of those upside-down tracebacks. WHY???]

faulthandler is really a very low-level module. I wrote it to analyze deadlocks, memory corruptions and other sad cases. faulthandler is designed to provide as much information as possible even if Python internals are corrupted. So it only reads data structures, without processing them too much. For example, frames are stored from the most recent to the oldest. faulthandler also truncates long traceback to 100 frames (write "..." if there are more), to stop the loop if there is a cycle in the chained list of frames.

The documentation also mentions other limitations:
http://docs.python.org/dev/library/faulthandler.html
msg200671 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-21 00:28
> Hm... Maybe the wakeup_fd machinery is broken on some systems?

Hey! I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX?
msg200672 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2013-10-21 00:33
Just wondering:  why is the timeout-to-failure set to a whole hour?  Based on all the messages I've seen this weekend, we could save the buildbots countless years of time by reducing it to - say - 59 minutes ;-)
msg200689 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-21 03:06
> I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX?

@haypo: test_signal skips some tests on AIX.  See Issue #18238 .
msg200713 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-21 08:09
2013/10/21 David Edelsohn <report@bugs.python.org>:
>> I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX?
>
> @haypo: test_signal skips some tests on AIX.  See Issue #18238 .

Ah yes, the test on select+signal is skipped, but the basic feature
(writing a signal number in a pipe using wakeup fd) works on AIX.
msg200720 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-21 08:37
I'm completely lost: how many issues is this thread conflating?
For example, David, I think you said that my patch regarding signal
handling did fix some tests.
Is that still the case?
If yes, could you open a specific issue for the remaining failing tests on AIX?

AFAICT, the original hang is due to AIX always reporting the write-end
of a pipe as read-ready: we should only keep discussing this problem
here.

The above FreeBSD and Ubuntu failures seem completely unrelated, they
should get there own issues.
msg200745 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-21 10:23
Would it make sense to use socketpair() instead of pipe() on AIX?  We could check for the "bug" directly rather than checking specifically for AIX.
msg200791 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-21 14:50
OK, I've created issue 19334 to track test_asyncio hangs on other platforms. Let's please keep this issue restricted to AIX. I am aware of the following issue:

- select for read on the write end of a pipe returns immediately

Any other AIX issues also please open separate issues (and assign to me to get my immediate attention).

I wonder if we can at least dynamically detect that this does not work in the _read_ready() handler or even in the setup? ISTM that reading from the write end should immediately fail with EBADF. (It does on my system, but I don't want to generalize. :-)
msg200794 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-21 14:55
@Richard: socketpair() seems to work fine on AIX (we use it internally). Maybe we can always use that for subprocess management? But will the subprocess module support this?
msg200806 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-10-21 16:19
The following uses socketpair() instead of pipe() for stdin, and works for me on Linux:

diff -r 7d94e4a68b91 asyncio/unix_events.py
--- a/asyncio/unix_events.py    Sun Oct 20 20:25:04 2013 -0700
+++ b/asyncio/unix_events.py    Mon Oct 21 17:15:19 2013 +0100
@@ -272,8 +272,6 @@
         self._loop = loop
         self._pipe = pipe
         self._fileno = pipe.fileno()
-        if not stat.S_ISFIFO(os.fstat(self._fileno).st_mode):
-            raise ValueError("Pipe transport is for pipes only.")
         _set_nonblocking(self._fileno)
         self._protocol = protocol
         self._buffer = []
@@ -442,9 +440,16 @@
         self._finished = False
         self._returncode = None

+        if stdin == subprocess.PIPE:
+            stdin_w, stdin_r = socket.socketpair()
+        else:
+            stdin_w = stdin_r = None
         self._proc = subprocess.Popen(
-            args, shell=shell, stdin=stdin, stdout=stdout, stderr=stderr,
+            args, shell=shell, stdin=stdin_r, stdout=stdout, stderr=stderr,
             universal_newlines=False, bufsize=bufsize, **kwargs)
+        if stdin_r is not None:
+            stdin_r.close()
+            self._proc.stdin = open(stdin_w.detach(), 'rb', buffering=bufsize)
         self._extra['subprocess'] = self._proc

     def close(self):
msg200808 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-21 16:28
Brilliant. It works for me too on OS X.

David, can you check this on AIX?
msg200811 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-21 16:34
Attached is a version of Richard's patch that I like better (sockpair.diff).

- instead of dropping the ValueError, check for ISFIFO and ISSOCK
- pass the read end of the socket pair to Popen()
msg200814 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-21 17:08
New, improved version of sockpair.diff.
msg200860 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-22 02:47
The testsuite is hanging in test_asyncio again and test_subprocess_close_client_stream needs to be skipped.
msg200861 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-22 02:48
Other than the skipped tests, test_asyncio now passes on AIX except for:

FAIL: test_write_pipe (test.test_asyncio.test_events.PollEventLoopTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dje/src/cpython/Lib/test/test_asyncio/test_events.py", line 875, in test_write_pipe
    self.assertEqual(b'2345', data)
AssertionError: b'2345' != b''
msg200867 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-22 03:31
Thanks; I've applied and pushed your fix for the hang.

I think it's time to commit the sockpair.diff change; the test_write_pipe() failure looks related -- as you recall the problem doesn't really cause a hang, it causes a premature closing of a write pipe, the hangs were just due to the test structure (unfortunate, really).

I'll let you know when.
msg200868 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-22 03:37
New changeset 2a0bda8d283d by Guido van Rossum in branch 'default':
Switch subprocess stdin to a socketpair, attempting to fix issue #19293 (AIX hang).
http://hg.python.org/cpython/rev/2a0bda8d283d
msg200869 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-22 03:38
Let's see how this fares. If it passes on the AIX buildbot, maybe see which of the tests we skipped on AIX can be re-enabled again?
msg200872 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-22 03:55
With the latest changeset applied to default, the test_subprocess tests no longer hang, but test_write_pipe_disconnect_on_close now hangs.

The attached patch changes test_event.py to not skip test_subprocess_* but now skips test_write_pipe_disconnect_on_close.
msg200873 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-22 03:58
I think I know what that is -- the test should use a socketpair now. I'll see how quickly I can give you a patch for that. The other one is now committed.
msg200874 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-22 04:09
David, try this fix (aixfix.diff).
msg200876 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-22 04:23
test_asyncio no longer hangs on AIX with the aixfix.diff patch. Thanks!
msg200879 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-22 04:27
Third try's a charm. Corrected aix fix.
msg200880 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-22 04:29
New changeset c2e018c54689 by Guido van Rossum in branch 'default':
Fix asyncio issue #19293 (hangs on AIX).
http://hg.python.org/cpython/rev/c2e018c54689
msg200881 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-10-22 04:30
"And there was much rejoicing."
msg200882 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2013-10-22 04:31
You have to try harder to break it again. aixfix2.diff still works. ;-)
History
Date User Action Args
2022-04-11 14:57:52adminsetgithub: 63492
2013-10-22 04:31:54David.Edelsohnsetmessages: + msg200882
2013-10-22 04:30:21gvanrossumsetstatus: open -> closed
resolution: fixed
messages: + msg200881

stage: resolved
2013-10-22 04:29:19python-devsetmessages: + msg200880
2013-10-22 04:27:17gvanrossumsetfiles: - aixfix.diff
2013-10-22 04:27:01gvanrossumsetfiles: + aixfix2.diff

messages: + msg200879
2013-10-22 04:25:47gvanrossumsetmessages: - msg200878
2013-10-22 04:24:21gvanrossumsetmessages: + msg200878
2013-10-22 04:23:40gvanrossumsetmessages: - msg200877
2013-10-22 04:23:02gvanrossumsetmessages: + msg200877
2013-10-22 04:23:01David.Edelsohnsetmessages: + msg200876
2013-10-22 04:09:16gvanrossumsetfiles: + aixfix.diff

messages: + msg200874
2013-10-22 03:58:42gvanrossumsetmessages: + msg200873
2013-10-22 03:55:12David.Edelsohnsetfiles: + issue19293.patch3

messages: + msg200872
2013-10-22 03:38:53gvanrossumsetmessages: + msg200869
2013-10-22 03:37:23python-devsetmessages: + msg200868
2013-10-22 03:31:45gvanrossumsetmessages: + msg200867
2013-10-22 02:48:00David.Edelsohnsetmessages: + msg200861
2013-10-22 02:47:20David.Edelsohnsetfiles: + issue19293.patch2

messages: + msg200860
2013-10-21 17:08:17gvanrossumsetfiles: - sockpair.diff
2013-10-21 17:08:04gvanrossumsetfiles: + sockpair.diff

messages: + msg200814
2013-10-21 16:34:47gvanrossumsetfiles: + sockpair.diff

messages: + msg200811
2013-10-21 16:28:23gvanrossumsetmessages: + msg200808
2013-10-21 16:19:05sbtsetmessages: + msg200806
2013-10-21 15:09:12floxsetnosy: + flox
2013-10-21 14:55:34gvanrossumsetmessages: + msg200794
2013-10-21 14:50:52gvanrossumsetmessages: + msg200791
title: test_asyncio hanging for 1 hour -> test_asyncio hanging for 1 hour (AIX version, hangs in test_subprocess_interactive)
2013-10-21 10:23:03sbtsetmessages: + msg200745
2013-10-21 08:37:01neologixsetmessages: + msg200720
2013-10-21 08:09:05vstinnersetmessages: + msg200713
2013-10-21 03:06:05David.Edelsohnsetmessages: + msg200689
2013-10-21 00:33:29tim.peterssetnosy: + tim.peters
messages: + msg200672
2013-10-21 00:28:37vstinnersetmessages: + msg200671
2013-10-21 00:19:32vstinnersetnosy: + vstinner
messages: + msg200666
2013-10-20 19:01:42gvanrossumsetmessages: + msg200615
2013-10-20 17:37:10sbtsetnosy: + sbt
messages: + msg200599
2013-10-20 17:36:37neologixsetmessages: + msg200598
2013-10-20 16:55:54gvanrossumsetmessages: + msg200597
2013-10-20 14:08:36David.Edelsohnsetmessages: + msg200589
2013-10-20 13:44:20David.Edelsohnsetmessages: + msg200582
2013-10-20 08:52:08neologixsetmessages: + msg200553
2013-10-20 08:42:09gvanrossumsetmessages: + msg200552
2013-10-20 08:26:43neologixsetmessages: + msg200551
2013-10-20 06:37:40gvanrossumsetmessages: + msg200547
2013-10-20 04:30:40gvanrossumsetmessages: + msg200544
2013-10-20 04:20:58db3lsetmessages: + msg200543
2013-10-20 03:56:17gvanrossumsetmessages: + msg200541
2013-10-20 02:45:40David.Edelsohnsetmessages: + msg200534
2013-10-20 01:40:52pitrousetmessages: + msg200533
2013-10-20 01:35:59David.Edelsohnsetmessages: + msg200532
2013-10-20 01:31:12David.Edelsohnsetmessages: + msg200531
2013-10-20 01:29:55pitrousetnosy: + ncoghlan
messages: + msg200530
2013-10-19 23:55:54gvanrossumsetmessages: + msg200525
2013-10-19 23:43:55pitrousetnosy: + db3l
messages: + msg200521
2013-10-19 23:42:31gvanrossumsetmessages: + msg200520
2013-10-19 23:34:07neologixsetmessages: + msg200518
2013-10-19 23:26:56gvanrossumsetmessages: + msg200516
2013-10-19 23:14:20larrysetnosy: + larry
messages: + msg200514
2013-10-19 23:13:51pitrousetmessages: + msg200513
2013-10-19 22:50:22gvanrossumsetmessages: + msg200510
2013-10-19 22:09:28neologixsetmessages: + msg200505
2013-10-19 20:35:54David.Edelsohnsetmessages: + msg200494
2013-10-19 19:53:57neologixsetfiles: + asyncio_signal.diff

messages: + msg200487
2013-10-19 18:44:36pitrousetnosy: + pitrou
messages: + msg200475
2013-10-19 18:24:17neologixsetmessages: + msg200470
2013-10-19 18:17:26David.Edelsohnsetmessages: + msg200466
2013-10-19 18:11:35gvanrossumsetmessages: + msg200463
2013-10-19 18:09:35David.Edelsohnsetmessages: + msg200461
2013-10-19 18:03:28neologixsetmessages: + msg200456
2013-10-19 17:47:40David.Edelsohnsetmessages: + msg200452
2013-10-19 17:36:21David.Edelsohnsetmessages: + msg200449
2013-10-19 16:41:06gvanrossumsetassignee: gvanrossum
2013-10-19 16:40:25gvanrossumsetmessages: + msg200440
2013-10-19 16:31:06skrahsettitle: test_asyncio failures on AIX -> test_asyncio hanging for 1 hour
2013-10-19 16:29:42skrahlinkissue19285 superseder
2013-10-19 16:26:33skrahsetnosy: + skrah
messages: + msg200438
2013-10-19 16:25:38neologixsetnosy: + neologix
messages: + msg200437
2013-10-19 16:18:25gvanrossumsetmessages: + msg200435
2013-10-19 16:10:46python-devsetnosy: + python-dev
messages: + msg200433
2013-10-19 15:40:45David.Edelsohnsetfiles: + issue19293.patch
keywords: + patch
messages: + msg200426

components: + Tests
2013-10-19 05:12:28gvanrossumsetmessages: + msg200374
2013-10-19 05:07:22gvanrossumsetnosy: + gvanrossum
messages: + msg200372
2013-10-19 04:20:34David.Edelsohncreate