classification
Title: fileinput requires two EOF when reading stdin
Type: behavior Stage: needs patch
Components: Documentation, Library (Lib) Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Arfrever, docs@python, flox, jason.coombs, jgeralnik, kristjan.jonsson, pitrou, r.david.murray, serhiy.storchaka, zach.ware
Priority: normal Keywords: patch

Created on 2012-06-14 15:19 by jason.coombs, last changed 2012-06-19 09:14 by kristjan.jonsson.

Files
File name Uploaded Description Edit
fileinput.patch jgeralnik, 2012-06-15 14:20 review
Messages (23)
msg162798 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2012-06-14 15:19
I found that fileinput.input() requires two EOF characters to stop reading input on Python 2.7.3 on Windows and Ubuntu:

PS C:\Users\jaraco> python
Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
>>> import fileinput
>>> lines = list(fileinput.input())
foo
bar
^Z
^Z
>>> lines
['foo\n', 'bar\n']

I don't see anything in the documentation that suggests that two EOF characters would be required, and I can't think of any reason why that should be the case.
msg162799 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2012-06-14 15:23
I observed if I send EOF as the first character, it honors it immediately and doesn't require a second EOF.
msg162802 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-14 16:07
Frankly I'm surprised it works at all, since fileinput.input() will by default read from stdin, and stdin is in turn being read by the python prompt.

I just checked 2.5 on linux, and the same situation exists there (two ^Ds are required to end the input()).  I suspect we'll find the explanation in the interaction between the default behavior of fileinput.input() and the interactive prompt.
msg162803 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2012-06-14 16:10
FWIW, I encountered the double-EOF behavior when invoking fileinput.input from a script running non-interactively (except of course for the input() call).
msg162808 - (view) Author: Zachary Ware (zach.ware) * Date: 2012-06-14 17:14
I just tested on Python 3.2, and found something interesting; it seems a ^Z character on a line that has other input read in as a character.  Also, other input after an EOF on its own means you still have to do two more EOFs to end.

Python 3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import fileinput
>>> lines = list(fileinput.input())
test
testing
^Z
^Z
>>> lines
['test\n', 'testing\n']
>>> lines = list(fileinput.input())
test
testing^Z
^Z
^Z
>>> lines
['test\n', 'testing\x1a\n']
>>> lines = list(fileinput.input())
testing^Z
test
^Z
testing
^Z
^Z
>>> lines
['testing\x1a\n', 'test\n', 'testing\n']

Also, the documentation for fileinput doesn't mention EOF at all.
msg162809 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-14 17:32
I don't know how the EOF character works, but I wouldn't be surprised if it had to be on a line by itself to mean EOF.

If the double EOF is required when not at the interactive prompt, then there could be a long standing bug in fileinput's logic where it is doing another read after the last file is closed.  Normally this wouldn't even be visible since it would just get EOF again, but when the file is an interactive STDIN, closing it doesn't really close it...
msg162815 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-14 19:09
It is not only the fileinput. The same effect can be achieved by simple idiomatic code:

import sys
while True:
    chunk = sys.stdin.read(1000)
    if not chunk:
        break
    # process
msg162817 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-14 19:35
That makes sense.  It is a consequence of (a) buffered input and (b) the fact that EOF on stdin doesn't really close it.  (And by interactive here I don't just mean Python's interactive prompt, but also the shell).

By default fileinput uses readlines with a buffer size, so it suffers from the same issue.  It is only the second time that you close stdin that it gets an empty buffer, and so terminates.

Anyone want to try to come up with a doc footnote to explain this?
msg162820 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-14 20:02
Note that in the rare cases, when stdio ends immediately on the limit of the read buffer, just one EOF is sufficient. In particular for read(1) one EOF is sufficient always, and for read(2) it is sufficient in about half of the cases.
msg162821 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-14 20:18
It is unlikely to be solvable at the Python level. Witness the raw stream's behaviour (in Python 3):

>>> sys.stdin.buffer.raw.read(1000)

If you type a letter followed by ^D (Linux) or ^Z (Windows), this returns immediately:

>>> sys.stdin.buffer.raw.read(1000)
x^Db'x'

But since the result is non-empty, the buffering layer will not detect the EOF and will call read() on the raw stream again (as the 1000 bytes are not satisfied). To signal EOF to the buffered stream, you have to type ^D or ^Z *without preceding it with another character*. Try the following:

>>> sys.stdin.buffer.read(1000)

You'll see that as long as you type a letter before ^D or ^Z, the read() will not return (until you type more than 1000 characters, that is):
- ^D alone: returns!
- a letter followed by ^D: doesn't return
- a letter followed by ^D followed by ^D: returns!
- a letter followed by ^D followed by a letter followed by ^D: doesn't return

This is all caused by the fact that a C read() on stdin doesn't return until either the end of line or EOF (or the requested bytes number is satisfied). Just experiment with:

>>> os.read(0, 1000)

That's why I say this is not solvable at the Python level (except perhaps with bizarre ioctl hackery).
msg162903 - (view) Author: Joey Geralnik (jgeralnik) Date: 2012-06-15 14:20
First off, I'm a complete noob looking at the python source code for the first time so forgive me if I've done something wrong.

What if the length of the chunk is checked as well? The following code works fine:

import sys
while True:
    chunk = sys.stdin.read(1000)
    if not chunk:
        break
    # process
    if len(chunk) < 1000:
        break

Something similar could be done in the fileinput class. The patch I've attached checks if the number of bytes read from the file is less than the size of the buffer (which means that the file has ended). If so, the next time the file is to be read it skips to the next file instead.

joey@j-Laptop:~/cpython$ ./python 
Python 3.3.0a3+ (default:befd56673c80+, Jun 15 2012, 17:14:12) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fileinput
[73732 refs]
>>> lines = list(fileinput.input())
foo
bar
^D
[73774 refs]
>>> lines
['foo\n', 'bar\n']
[73780 refs]
msg162905 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-15 14:41
>  The patch I've attached checks if the number of bytes read from the file is less than the size of the buffer (which means that the file has ended).

>From io.RawIOBase.read docs:

"""
Read up to n bytes from the object and return them. As a convenience, if
n is unspecified or -1, readall() is called. Otherwise, only one system
call is ever made. Fewer than n bytes may be returned if the operating
system call returns fewer than n bytes.

If 0 bytes are returned, and n was not 0, this indicates end of file.
"""

This is not an arbitrary assumption. In particular, when reading from a
terminal with line buffering (you can edit the line until you press
Enter) on C level you read only a whole line (if line length is not
greater than buffer length) and 0 bytes you will receive only by
pressing ^D or ^Z at the beginning of the line. Same for pipes and
sockets. On Python level there are many third-party implementations of
file-like objects which rely on this behavior, you cannot rewrite all of
them.
msg162906 - (view) Author: Joey Geralnik (jgeralnik) Date: 2012-06-15 14:59
But this is calling the readlines function, which continually reads from the file until more bytes have been read than the specified argument.

From bz2.readlines:
"size can be specified to control the number of lines read: no further lines will be read once the total size of the lines read so far equals or exceeds size."

Do other file-like objects interpret this parameter differently?
msg162907 - (view) Author: Joey Geralnik (jgeralnik) Date: 2012-06-15 15:04
Forget other filelike objects. The FileInput class only works with actual files, so the readlines function should always return at least as many bytes as its first parameter. Is this assumption wrong?
msg162908 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-15 15:23
fileinput should work (for some definition of work) for anything that can be opened by name using the open syscall on unix.  That includes many more things than files.  Named pipes are a particularly interesting example in this context.
msg162909 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-15 15:29
So the real question is: does readlines block until the byte count is satisified?  It might, but the docs for io.IOBase.readlines leave open the possibility that fewer lines will be read, and do not limit that to the EOF case.  It's not clear, however, if that is because the non-EOF-short-read case is specifically being allowed for, or if the documenter just didn't consider that case.
msg162910 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-15 15:32
The _pyio.py version of readlines does read until the count is equaled or exceeded.  This could, however, be an implementation detail and not part of the spec.
msg162911 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-15 15:40
Le vendredi 15 juin 2012 à 14:41 +0000, Serhiy Storchaka a écrit :
> >From io.RawIOBase.read docs:
> 
> """
> Read up to n bytes from the object and return them. As a convenience, if
> n is unspecified or -1, readall() is called. Otherwise, only one system
> call is ever made. Fewer than n bytes may be returned if the operating
> system call returns fewer than n bytes.

But sys.stdin does not implement RawIOBase, it implements TextIOBase.
msg162912 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-15 15:44
> Forget other filelike objects. The FileInput class only works with actual files,

No. sys.stdin can be reassigned before using FileInput. And FileInput
has openhook parameter (for read compressed files or get files from Web,
for example).

>  so the readlines function should always return at least as many bytes as its first parameter. Is this assumption wrong?

qwert
'qwert\n'

You type five characters "qwert" end press <Enter>. Python immediately
receives these six characters, and returns a result of
sys.stdin.readline(1000). Only six characters, and no one symbol more,
because more characters you have not entered yet.

I believe that for such questions will be more appropriate to use a
mailing list (python-list@python.org, or newsgroup
gmane.comp.python.general on news://news.gmane.org), and not bugtracker.
msg162913 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-15 15:46
> >  so the readlines function should always return at least as many bytes as its first parameter. Is this assumption wrong?
> 
> qwert
> 'qwert\n'
> 
> You type five characters "qwert" end press <Enter>. Python immediately
> receives these six characters, and returns a result of
> sys.stdin.readline(1000).

Well, did you try readline() or readlines()?
msg162916 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-15 16:29
> But sys.stdin does not implement RawIOBase, it implements TextIOBase.

sys.stdin.buffer.raw implements RawIOBase.
msg162917 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-15 16:36
> > 
> > qwert
> > 'qwert\n'

Oh, it seems that the mail server again ate some lines of my examples.

> Well, did you try readline() or readlines()?

Yes, it's my mistake, I used readline().
msg162920 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-15 16:38
> Oh, it seems that the mail server again ate some lines of my examples.

This is a bug in the e-mail gateway. You can lobby for a fix at
http://psf.upfronthosting.co.za/roundup/meta/issue264
History
Date User Action Args
2012-06-19 09:14:15kristjan.jonssonsetmessages: - msg163150
2012-06-19 09:13:32kristjan.jonssonsetnosy: + kristjan.jonsson
messages: + msg163150
2012-06-16 09:08:48floxsetnosy: + flox
2012-06-15 16:38:32pitrousetmessages: + msg162920
2012-06-15 16:36:13serhiy.storchakasetmessages: + msg162917
2012-06-15 16:29:43serhiy.storchakasetmessages: + msg162916
2012-06-15 15:46:51pitrousetmessages: + msg162913
2012-06-15 15:44:55serhiy.storchakasetmessages: + msg162912
2012-06-15 15:40:02pitrousetmessages: + msg162911
2012-06-15 15:32:43r.david.murraysetmessages: + msg162910
2012-06-15 15:29:24r.david.murraysetmessages: + msg162909
2012-06-15 15:23:23r.david.murraysetmessages: + msg162908
2012-06-15 15:04:19jgeralniksetmessages: + msg162907
2012-06-15 14:59:06jgeralniksetmessages: + msg162906
2012-06-15 14:41:19serhiy.storchakasetmessages: + msg162905
2012-06-15 14:20:34jgeralniksetfiles: + fileinput.patch

nosy: + jgeralnik
messages: + msg162903

keywords: + patch
2012-06-14 20:58:57Arfreversetnosy: + Arfrever
2012-06-14 20:18:17pitrousetnosy: + pitrou
messages: + msg162821
2012-06-14 20:02:44serhiy.storchakasetmessages: + msg162820
2012-06-14 19:35:37r.david.murraysetnosy: + docs@python
messages: + msg162817

assignee: docs@python
components: + Documentation
2012-06-14 19:09:55serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg162815
2012-06-14 17:32:57r.david.murraysettype: behavior
messages: + msg162809
stage: needs patch
2012-06-14 17:14:49zach.waresetnosy: + zach.ware

messages: + msg162808
versions: + Python 3.2, Python 3.3
2012-06-14 16:10:47jason.coombssetmessages: + msg162803
2012-06-14 16:07:33r.david.murraysetnosy: + r.david.murray
messages: + msg162802
2012-06-14 15:23:17jason.coombssetmessages: + msg162799
2012-06-14 15:19:34jason.coombscreate