classification
Title: You cannot call communicate() safely after receiving an exception (EINTR or EAGAIN)
Type: Stage: needs patch
Components: IO Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: amrith, gps, pitrou, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2014-07-31 11:58 by amrith, last changed 2014-08-05 16:40 by amrith. This issue is now closed.

Messages (13)
msg224398 - (view) Author: Amrith Kumar (amrith) Date: 2014-07-31 11:58
Environment:
- Linux (Ubuntu 14.04 LTS, x64) running Python 2.7.
- Code uses eventlet.green.subprocess

Code establishes subprocess as:

   subprocess.Popen(command,
                    stdin=subprocess.PIPE,
                    stdout=subprocess.PIPE,
                    stderr=subprocess.PIPE, ...)

It then calls communicate()

communicate() throws EAGAIN

The code calls communicate() again

This fails with "ValueError: I/O operation on closed file"

This exception comes from eventlet/greenio.py because an operation (flush) has been attempted on a closed file.

The complete explanation of this failure is this.

When communicate() is called in this way (WITH NO INPUT but with all three handles stdin, stdout, stderr), code in communicate() bypasses the "1 handle optimization" and goes directly to _communicate()

_communicate() will first send any input on stdin along, and flush and close stdin(). 

From that point forward, if anything to do with stderr and stdout returns EAGAIN, an attempt to call communicate() again will fail because stdin has now been closed().

This is because the code in _communicate() preemptively closes stdin if there is no input but does not reset stdin.

       def _communicate(self, input):
            if self.stdin:
                # Flush stdio buffer.  This might block, if the user has
                # been writing to .stdin in an uncontrolled fashion.
                self.stdin.flush()
                if not input:
                    self.stdin.close()

The fix for this is relatively straightforward (conceptually).

       def _communicate(self, input):
            if self.stdin:
                # Flush stdio buffer.  This might block, if the user has
                # been writing to .stdin in an uncontrolled fashion.
                self.stdin.flush()
                if not input:
                    self.stdin.close()
                    self.stdin = None # This should take care of it.

However, a partial workaround from the client perspective is this.

1. If you have no input, set stdin to None
2. If you do have input and you get EAGAIN, you cannot safely retry because your input may have already been completely or partially sent to the subprocess; you have to assume failure.

--------------------
Backtrace:

Traceback (most recent call last):
	File "/opt/stack/trove/trove/tests/unittests/guestagent/test_mongodb_manager.py", line 58, in test_prepare_from_backup
		self._prepare_dynamic(backup_id='backup_id_123abc')
	File "/opt/stack/trove/trove/tests/unittests/guestagent/test_mongodb_manager.py", line 91, in _prepare_dynamic
		backup_info=backup_info)
	File "trove/guestagent/datastore/mongodb/manager.py", line 66, in prepare
		operating_system.update_owner('mongodb', 'mongodb', mount_point)
	File "trove/guestagent/common/operating_system.py", line 109, in update_owner
		run_as_root=True, root_helper="sudo")
	File "trove/common/utils.py", line 278, in execute_with_timeout
		return execute(*args, **kwargs)
	File "trove/openstack/common/processutils.py", line 186, in execute
		result = obj.communicate()
	File "/usr/lib/python2.7/subprocess.py", line 799, in communicate
		return self._communicate(input)
	File "/usr/lib/python2.7/subprocess.py", line 1396, in _communicate
		self.stdin.flush()
	File "/opt/stack/trove/.tox/py27/local/lib/python2.7/site-packages/eventlet/greenio.py", line 419, in _operationOnClosedFile
		raise ValueError("I/O operation on closed file")

ValueError: I/O operation on closed file
msg224399 - (view) Author: Amrith Kumar (amrith) Date: 2014-07-31 12:22
I'll supply a patch for this proposed change.
msg224405 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-07-31 14:27
I believe our recent thinking in recent times has been to handle EAGAIN in the higher level interfaces such as subprocess, not expose them to the caller.  Does python3 have the same problem?

I isn't clear that you have provided enough information to reproduce this.  Any chance you could reproduce it without using eventlet?
msg224411 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-07-31 15:10
This sounds like an issue specific to eventlet. There's no reason for the standard communicate() call to raise EAGAIN, since the created pipes are blocking.

That said, if the fix is that trivial, I wouldn't be opposed to see it enter the stdlib (assuming it doesn't break anything).
msg224412 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-31 15:19
Instead of just catching EAGAIN, you can catch BlockingIOError. In Python 2.7, it means catching: (EAGAIN, EALREADY, EINPROGRESS, EWOULDBLOCK) errors.

I made a similar change in asyncore recently:
---
changeset:   91833:d422062d7d36
branch:      2.7
parent:      91830:841cdb6145e9
user:        Victor Stinner <victor.stinner@gmail.com>
date:        Thu Jul 24 19:15:00 2014 +0200
files:       Doc/library/asyncore.rst Lib/asynchat.py Lib/test/test_asynchat.py Misc/NEWS
description:
Issue #16133: The asynchat.async_chat.handle_read() method now ignores
socket.error() exceptions with blocking I/O errors: EAGAIN, EALREADY,
EINPROGRESS, or EWOULDBLOCK. Initial patch written by Xavier de Gaye.
---

This issue is another reason to not use eventlet with monkey-patching :-) It's time to switch to asyncio (trollius for Python 2)!
msg224419 - (view) Author: Amrith Kumar (amrith) Date: 2014-07-31 15:36
I see three comments, one from r.david.murray, one from haypo and one from pitrou. I'll try and address all three.

r.david.murray:

The code in question is in https://github.com/openstack/oslo-incubator/blob/master/openstack/common/processutils.py#L177-L189

note that we're catching EAGAIN and EINTR.

I have not been able to isolate this down to a simple repro without the rest of this paraphernalia but I'm trying.

So, we are 'catching' EAGAIN or EINTR here and we're trying to handle it to the best of our ability. However, if the underlying layer is not setup to handle a retry, our best efforts will be fruitless.

That is what is happening here. 

The reason for this code (ignoring the retry of 20) was put in place exactly because a call to communicate() received an EAGAIN.

The issue therefore is that in order for the higher level to properly handle this, communicate() should be setup to handle a second call, which it currently is not. 

haypo and pitrou: that may be true; I'm not competent to comment on that.

But, as pointed out in earlier comment (and modulo this may be eventlet specific), just catching more exceptions isn't the answer.

if the descriptor is closed, the thing that communicate/_communicate() call should be able to handle that situation. And this bug illustrates that at least eventlet doesn't handle that.

However, I submit to you that this is NOT an eventlet issue. Here's why.

The failure here is that a flush call is being attempted on a closed descriptor. I believe that the implementation of flush (in eventlet) is legitimately throwing an exception indicating that the state machine was violated (cannot flush on closed descriptor). 

The close() was invoked by subprocess.py after it finished doing what it thought it had to do with stdin on the first invocation. therefore I believe it must be the responsibility of subprocess.py to make sure that when invoked again, it doesn't step on itself.

Either that, or subprocess.py's communicate() implementation should indicate that it can only be called once, capture all exceptions that would point a user to retry (such as EAGAIN and EINTR) and mask them and return some EFATAL.
msg224421 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-31 15:46
> However, I submit to you that this is NOT an eventlet issue. (...) capture all exceptions that would point a user to retry (such as EAGAIN and EINTR) and mask them and return some EFATAL.

You cannot get EAGAIN on a blocking file descriptor and subprocess
uses blocking file descriptors. It looks like eventlet monkey patching
makes file descriptors non-blocking.
msg224423 - (view) Author: Amrith Kumar (amrith) Date: 2014-07-31 15:58
haypo; no argument from me. I just updated the description to make it more exact.

I may have received EINTR or EAGAIN.

In either event, I need to debug some more and try and do a clean repro and then figure out a proper fix.

In the meanwhile, I proposed a fix for the openstack code; if you are so inclined, please take a look.

https://review.openstack.org/#/c/110933/
msg224425 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-07-31 16:11
By the definition of the (blocking) communicate API, communicate can only be called once (it feeds in all the input, and reads all the output).
msg224426 - (view) Author: Amrith Kumar (amrith) Date: 2014-07-31 16:16
r.david.murray, I guess the issue we have is that something (hypothesis: eventlet) is causing the blocking API to communicate() to throw an EAGAIN.

This (https://bugs.launchpad.net/nova/+bug/1271331) is the bug that was fixed a while ago in OpenStack that introduced the notion of calling communicate() multiple times.
msg224531 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-08-02 00:24
> By the definition of the (blocking) communicate API, communicate can only be called once (it feeds in all the input, and reads all the output).

Since Python 3.3, communicate() can now fail with TimeoutExpired. On UNIX, it looks like it's possible to call again communicate() in this case. On Windows, the implementation is different: stdin.write() doesn't use the timeout, so the write can hang longer than the timeout :-(

Use asyncio is you want reliable and efficient communicate() on all platforms ;-)
msg224603 - (view) Author: Amrith Kumar (amrith) Date: 2014-08-03 01:21
After some debugging and reading code in python's subprocess.py (v2.7), here's what I'm seeing.

(a) the error has nothing to do with eventlet and monkey-patching.
(b) the code in _communicate_with_select() and potentially _communicate_with_poll() are the problem.

What's the problem?
-------------------

The code in _communicate_with_select() correctly sets up to handle the read and write calls without blocking on any of them. It does this by establishing the two (read and write) lists of descriptors and calls select without no timeout specified. 

When select returns, and indicates that a socket is in (for example) the read list, what that means is that an attempt to read() will not block. However, it is possible on some systems (Linux for sure) that 

(a) a socket is non-blocking, and
(b) a call to select indicates that the socket is ready to read, and
(c) a call to read the socket returns an error EAGAIN (aka EWOULDBLOCK).

Callers of read() and write() on non-blocking sockets should be prepared to handle this situation. 

The python code in _communicate_with_select() is not.

It assumes that if select() returns that a read fd is ready for read, that a call to read it will produce 0 or more bytes. The calls to read() for stdout and stderr are not guarded with exception handlers. However, if a socket is setup as non-blocking, any call can produce EWOULDBLOCK, EAGAIN, ...

Adding some debugging code it was possible to recreate the problem and show that the backtrace was (in this case it happened with python 2.6)

Traceback (most recent call last):
[...]
  File "trove/openstack/common/processutils.py", line 186, in execute
    result = obj.communicate()
  File "/usr/lib64/python2.6/subprocess.py", line 732, in communicate
    stdout, stderr = self._communicate(input, endtime)
  File "/usr/lib64/python2.6/subprocess.py", line 1318, in _communicate
    stdout, stderr = self._communicate_with_select(input, endtime)
  File "/usr/lib64/python2.6/subprocess.py", line 1483, in _communicate_with_select
    data = os.read(self.stdout.fileno(), 1024)
OSError: [Errno 11] Resource temporarily unavailable

The correct fix for this is to make _communicate_with_select() and maybe _communicate_with_poll() properly handle the read() and write() calls and be better prepared to handle a thrown condition of EAGAIN from os.read or os.write.
msg224857 - (view) Author: Amrith Kumar (amrith) Date: 2014-08-05 16:40
The issue turned out to be something funky in monkey patching.

os.read() shouldn't be getting called if the monkey patch was correct in the first place.
History
Date User Action Args
2014-08-05 16:40:16amrithsetstatus: open -> closed
resolution: not a bug
messages: + msg224857
2014-08-03 01:21:37amrithsetmessages: + msg224603
2014-08-02 00:24:27vstinnersetmessages: + msg224531
2014-07-31 16:16:22amrithsetmessages: + msg224426
2014-07-31 16:11:18r.david.murraysetmessages: + msg224425
2014-07-31 15:58:47amrithsetmessages: + msg224423
2014-07-31 15:56:32amrithsettitle: You cannot call communicate() safely after receiving EAGAIN -> You cannot call communicate() safely after receiving an exception (EINTR or EAGAIN)
2014-07-31 15:46:34vstinnersetmessages: + msg224421
2014-07-31 15:36:07amrithsetmessages: + msg224419
components: + IO, - Library (Lib)
2014-07-31 15:19:10vstinnersetmessages: + msg224412
2014-07-31 15:10:09pitrousetnosy: + pitrou
messages: + msg224411
2014-07-31 14:27:22r.david.murraysetnosy: + r.david.murray, gps
messages: + msg224405

components: + Library (Lib), - IO
stage: needs patch
2014-07-31 12:24:53vstinnersetnosy: + vstinner
2014-07-31 12:22:26amrithsetmessages: + msg224399
2014-07-31 11:58:59amrithcreate