classification
Title: subprocess.Popen can't read file object as stdin after seek
Type: enhancement Stage:
Components: Documentation Versions: Python 3.2
process
Status: closed Resolution: later
Dependencies: Superseder:
Assigned To: astrand Nosy List: ajaksu2, astrand, gazzadee, ldeller, terry.reedy
Priority: low Keywords:

Created on 2006-08-25 05:52 by gazzadee, last changed 2010-08-07 21:10 by terry.reedy. This issue is now closed.

Files
File name Uploaded Description Edit
hello.txt gazzadee, 2006-08-25 05:52 text file for demo scripts
subprocess_seek_once.py gazzadee, 2006-08-25 05:54 shows Popen working properly after one seek
subprocess_seek_error.py gazzadee, 2006-08-25 05:55 shows Popen not reading stdin after multiple seeks
Messages (6)
msg29680 - (view) Author: GaryD (gazzadee) Date: 2006-08-25 05:52
When I use an existing file object as stdin for a call
to subprocess.Popen, then Popen cannot read the file if
I have called seek on it more than once.

eg. in the following python code:

>>> import subprocess
>>> rawfile = file('hello.txt', 'rb')
>>> rawfile.readline()
'line 1\n'
>>> rawfile.seek(0)
>>> rawfile.readline()
'line 1\n'
>>> rawfile.seek(0)
>>> process_object = subprocess.Popen(["cat"],
stdin=rawfile, stdout=subprocess.PIPE,
stderr=subprocess.PIPE)

process_object.stdout now contains nothing, implying
that nothing was on process_object.stdin.

Note that this only applies for a non-trivial seek (ie.
where the file-pointer actually changes). Calling
seek(0) multiple times in a row does not change
anything (obviously).

I have not investigated whether this reveals a problem
with seek not changing the underlying file descriptor,
or a problem with Popen not handling the file
descriptor properly.

I have attached some complete python scripts that
demonstrate this problem. One shows cat working after
calling seek once, the other shows cat failing after
calling seek twice.

Python version being used:
Python 2.4.2 (#1, Nov  3 2005, 12:41:57)
[GCC 3.4.3-20050110 (Gentoo Linux 3.4.3.20050110,
ssp-3.4.3.20050110-0, pie-8.7 on linux2
msg29681 - (view) Author: lplatypus (ldeller) Date: 2006-08-25 07:13
Logged In: YES 
user_id=1534394

I found the cause of this bug:

A libc FILE* (used by python file objects) may hold a
different file offset than the underlying OS file
descriptor.  The posix version of Popen._get_handles does
not take this into account, resulting in this bug.

The following patch against svn trunk fixes the problem.  I
don't have permission to attach files to this item, so I'll
have to paste the patch here:

Index: subprocess.py
===================================================================
--- subprocess.py       (revision 51581)
+++ subprocess.py       (working copy)
@@ -907,6 +907,12 @@
             else:
                 # Assuming file-like object
                 p2cread = stdin.fileno()
+                # OS file descriptor's file offset does not
necessarily match
+                # the file offset in the file-like object,
so do an lseek:
+                try:
+                    os.lseek(p2cread, stdin.tell(), 0)
+                except OSError:
+                    pass # file descriptor does not support
seek/tell

             if stdout is None:
                 pass
@@ -917,6 +923,12 @@
             else:
                 # Assuming file-like object
                 c2pwrite = stdout.fileno()
+                # OS file descriptor's file offset does not
necessarily match
+                # the file offset in the file-like object,
so do an lseek:
+                try:
+                    os.lseek(c2pwrite, stdout.tell(), 0)
+                except OSError:
+                    pass # file descriptor does not support
seek/tell

             if stderr is None:
                 pass
@@ -929,6 +941,12 @@
             else:
                 # Assuming file-like object
                 errwrite = stderr.fileno()
+                # OS file descriptor's file offset does not
necessarily match
+                # the file offset in the file-like object,
so do an lseek:
+                try:
+                    os.lseek(errwrite, stderr.tell(), 0)
+                except OSError:
+                    pass # file descriptor does not support
seek/tell

             return (p2cread, p2cwrite,
                     c2pread, c2pwrite,
msg29682 - (view) Author: Peter ├ůstrand (astrand) * (Python committer) Date: 2007-01-21 19:43
It's not obvious that the subprocess module is doing anything wrong here. Mixing streams and file descriptors is always problematic and should best be avoided (http://ftp.gnu.org/gnu/Manuals/glibc-2.2.3/html_node/libc_232.html). However, the subprocess module *does* accept a file object (based on a libc stream), for convenience. For things to work correctly, the application and the subprocess module needs to cooperate. I admit that the documentation needs improvement on this topic, though. 

It's quite easy to demonstrate the problem, you don't need to use seek at all. Here's a simple test case:

import subprocess
rawfile = file('hello.txt', 'rb')
rawfile.readline()
p = subprocess.Popen(["cat"], stdin=rawfile, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print "File contents from Popen() call to cat:"
print p.stdout.read()
p.wait()

The descriptor offset is at the end, since the stream buffers. http://ftp.gnu.org/gnu/Manuals/glibc-2.2.3/html_node/libc_233.html describes the need for "cleaning up" a stream, when you switch from stream functions to descriptor functions. This is described at http://ftp.gnu.org/gnu/Manuals/glibc-2.2.3/html_node/libc_235.html#SEC244. The documentation recommends the fclean() function, but it's only available on GNU systems and not in Python. As I understand it, fflush() works good for cleaning an output stream. 

For input streams, however, things are difficult. fflush() might work sometimes, but to be sure, you must set the file pointer as well. And, this does not work for files that are not random access, since there's no way of move the buffered data back to the operating system. 

So, since subprocess cannot reliable deal with this situation, I believe it shouldn't try. I think it makes more sense that the application prepares the file object for low-level operations. There are many other Python modules that uses the .fileno() method, for example the select() module, and as far as I understand, this module doesn't try to clean streams or anything like that. 

To summarize: I'm leaning towards a documentation solution. 
msg29683 - (view) Author: lplatypus (ldeller) Date: 2007-01-22 01:23
Fair enough, that's probably cleaner and more efficient than playing games with fflush and lseek anyway.  If file objects are not supported properly then maybe they shouldn't be accepted at all, forcing the application to call fileno() if that's what is wanted.  That might break a lot of existing code though.  Then again it may be beneficial to get everyone to review code which passes file objects to Popen in light of this behaviour.
msg84478 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-03-30 03:23
Not a bug, leaving open for the doc RFE (but suggest closing anyway).
msg113204 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-07 21:10
In the absence of a doc patch, I am following Daniel's suggestion to close this.
History
Date User Action Args
2010-08-07 21:10:32terry.reedysetstatus: open -> closed
versions: + Python 3.2, - Python 2.6
nosy: + terry.reedy

messages: + msg113204

resolution: later
2009-03-30 03:23:06ajaksu2setpriority: normal -> low

type: enhancement
components: + Documentation, - Library (Lib)
versions: + Python 2.6, - Python 2.4
nosy: + ajaksu2

messages: + msg84478
2006-08-25 05:52:12gazzadeecreate