Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fileinput requires two EOF when reading stdin #59273

Closed
jaraco opened this issue Jun 14, 2012 · 32 comments
Closed

fileinput requires two EOF when reading stdin #59273

jaraco opened this issue Jun 14, 2012 · 32 comments
Assignees
Labels
docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@jaraco
Copy link
Member

jaraco commented Jun 14, 2012

BPO 15068
Nosy @gvanrossum, @jaraco, @pitrou, @benjaminp, @bitdancer, @florentx, @vadmium, @zware, @serhiy-storchaka
Files
  • fileinput.patch
  • fileinput_no_buffer.patch
  • fileinput_no_buffer-2.7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-03-08.16:45:33.888>
    created_at = <Date 2012-06-14.15:19:34.375>
    labels = ['type-bug', 'library', 'docs']
    title = 'fileinput requires two EOF when reading stdin'
    updated_at = <Date 2016-03-08.21:37:04.705>
    user = 'https://github.com/jaraco'

    bugs.python.org fields:

    activity = <Date 2016-03-08.21:37:04.705>
    actor = 'python-dev'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-03-08.16:45:33.888>
    closer = 'serhiy.storchaka'
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2012-06-14.15:19:34.375>
    creator = 'jaraco'
    dependencies = []
    files = ['26018', '41225', '41226']
    hgrepos = []
    issue_num = 15068
    keywords = ['patch']
    message_count = 32.0
    messages = ['162798', '162799', '162802', '162803', '162808', '162809', '162815', '162817', '162820', '162821', '162903', '162905', '162906', '162907', '162908', '162909', '162910', '162911', '162912', '162913', '162916', '162917', '162920', '255813', '256721', '256893', '256905', '260106', '261366', '261378', '261381', '261382']
    nosy_count = 13.0
    nosy_names = ['gvanrossum', 'jaraco', 'pitrou', 'benjamin.peterson', 'Arfrever', 'r.david.murray', 'flox', 'docs@python', 'python-dev', 'martin.panter', 'zach.ware', 'serhiy.storchaka', 'jgeralnik']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue15068'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6']

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 14, 2012

    I found that fileinput.input() requires two EOF characters to stop reading input on Python 2.7.3 on Windows and Ubuntu:

    PS C:\Users\jaraco> python
    Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
    >>> import fileinput
    >>> lines = list(fileinput.input())
    foo
    bar
    ^Z
    ^Z
    >>> lines
    ['foo\n', 'bar\n']

    I don't see anything in the documentation that suggests that two EOF characters would be required, and I can't think of any reason why that should be the case.

    @jaraco jaraco added the stdlib Python modules in the Lib dir label Jun 14, 2012
    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 14, 2012

    I observed if I send EOF as the first character, it honors it immediately and doesn't require a second EOF.

    @bitdancer
    Copy link
    Member

    Frankly I'm surprised it works at all, since fileinput.input() will by default read from stdin, and stdin is in turn being read by the python prompt.

    I just checked 2.5 on linux, and the same situation exists there (two ^Ds are required to end the input()). I suspect we'll find the explanation in the interaction between the default behavior of fileinput.input() and the interactive prompt.

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 14, 2012

    FWIW, I encountered the double-EOF behavior when invoking fileinput.input from a script running non-interactively (except of course for the input() call).

    @zware
    Copy link
    Member

    zware commented Jun 14, 2012

    I just tested on Python 3.2, and found something interesting; it seems a ^Z character on a line that has other input read in as a character. Also, other input after an EOF on its own means you still have to do two more EOFs to end.

    Python 3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)] on win
    32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import fileinput
    >>> lines = list(fileinput.input())
    test
    testing
    ^Z
    ^Z
    >>> lines
    ['test\n', 'testing\n']
    >>> lines = list(fileinput.input())
    test
    testing^Z
    ^Z
    ^Z
    >>> lines
    ['test\n', 'testing\x1a\n']
    >>> lines = list(fileinput.input())
    testing^Z
    test
    ^Z
    testing
    ^Z
    ^Z
    >>> lines
    ['testing\x1a\n', 'test\n', 'testing\n']

    Also, the documentation for fileinput doesn't mention EOF at all.

    @bitdancer
    Copy link
    Member

    I don't know how the EOF character works, but I wouldn't be surprised if it had to be on a line by itself to mean EOF.

    If the double EOF is required when not at the interactive prompt, then there could be a long standing bug in fileinput's logic where it is doing another read after the last file is closed. Normally this wouldn't even be visible since it would just get EOF again, but when the file is an interactive STDIN, closing it doesn't really close it...

    @bitdancer bitdancer added the type-bug An unexpected behavior, bug, or error label Jun 14, 2012
    @serhiy-storchaka
    Copy link
    Member

    It is not only the fileinput. The same effect can be achieved by simple idiomatic code:

    import sys
    while True:
        chunk = sys.stdin.read(1000)
        if not chunk:
            break
        # process

    @bitdancer
    Copy link
    Member

    That makes sense. It is a consequence of (a) buffered input and (b) the fact that EOF on stdin doesn't really close it. (And by interactive here I don't just mean Python's interactive prompt, but also the shell).

    By default fileinput uses readlines with a buffer size, so it suffers from the same issue. It is only the second time that you close stdin that it gets an empty buffer, and so terminates.

    Anyone want to try to come up with a doc footnote to explain this?

    @bitdancer bitdancer added the docs Documentation in the Doc dir label Jun 14, 2012
    @serhiy-storchaka
    Copy link
    Member

    Note that in the rare cases, when stdio ends immediately on the limit of the read buffer, just one EOF is sufficient. In particular for read(1) one EOF is sufficient always, and for read(2) it is sufficient in about half of the cases.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 14, 2012

    It is unlikely to be solvable at the Python level. Witness the raw stream's behaviour (in Python 3):

    >> sys.stdin.buffer.raw.read(1000)

    If you type a letter followed by ^D (Linux) or ^Z (Windows), this returns immediately:

    >>> sys.stdin.buffer.raw.read(1000)
    x^Db'x'

    But since the result is non-empty, the buffering layer will not detect the EOF and will call read() on the raw stream again (as the 1000 bytes are not satisfied). To signal EOF to the buffered stream, you have to type ^D or ^Z *without preceding it with another character*. Try the following:

    >> sys.stdin.buffer.read(1000)

    You'll see that as long as you type a letter before ^D or ^Z, the read() will not return (until you type more than 1000 characters, that is):

    • ^D alone: returns!
    • a letter followed by ^D: doesn't return
    • a letter followed by ^D followed by ^D: returns!
    • a letter followed by ^D followed by a letter followed by ^D: doesn't return

    This is all caused by the fact that a C read() on stdin doesn't return until either the end of line or EOF (or the requested bytes number is satisfied). Just experiment with:

    >> os.read(0, 1000)

    That's why I say this is not solvable at the Python level (except perhaps with bizarre ioctl hackery).

    @jgeralnik
    Copy link
    Mannequin

    jgeralnik mannequin commented Jun 15, 2012

    First off, I'm a complete noob looking at the python source code for the first time so forgive me if I've done something wrong.

    What if the length of the chunk is checked as well? The following code works fine:

    import sys
    while True:
        chunk = sys.stdin.read(1000)
        if not chunk:
            break
        # process
        if len(chunk) < 1000:
            break

    Something similar could be done in the fileinput class. The patch I've attached checks if the number of bytes read from the file is less than the size of the buffer (which means that the file has ended). If so, the next time the file is to be read it skips to the next file instead.

    joey@j-Laptop:~/cpython$ ./python 
    Python 3.3.0a3+ (default:befd56673c80+, Jun 15 2012, 17:14:12) 
    [GCC 4.6.3] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import fileinput
    [73732 refs]
    >>> lines = list(fileinput.input())
    foo
    bar
    ^D
    [73774 refs]
    >>> lines
    ['foo\n', 'bar\n']
    [73780 refs]

    @serhiy-storchaka
    Copy link
    Member

    The patch I've attached checks if the number of bytes read from the file is less than the size of the buffer (which means that the file has ended).

    From io.RawIOBase.read docs:

    """
    Read up to n bytes from the object and return them. As a convenience, if
    n is unspecified or -1, readall() is called. Otherwise, only one system
    call is ever made. Fewer than n bytes may be returned if the operating
    system call returns fewer than n bytes.

    If 0 bytes are returned, and n was not 0, this indicates end of file.
    """

    This is not an arbitrary assumption. In particular, when reading from a
    terminal with line buffering (you can edit the line until you press
    Enter) on C level you read only a whole line (if line length is not
    greater than buffer length) and 0 bytes you will receive only by
    pressing ^D or ^Z at the beginning of the line. Same for pipes and
    sockets. On Python level there are many third-party implementations of
    file-like objects which rely on this behavior, you cannot rewrite all of
    them.

    @jgeralnik
    Copy link
    Mannequin

    jgeralnik mannequin commented Jun 15, 2012

    But this is calling the readlines function, which continually reads from the file until more bytes have been read than the specified argument.

    From bz2.readlines:
    "size can be specified to control the number of lines read: no further lines will be read once the total size of the lines read so far equals or exceeds size."

    Do other file-like objects interpret this parameter differently?

    @jgeralnik
    Copy link
    Mannequin

    jgeralnik mannequin commented Jun 15, 2012

    Forget other filelike objects. The FileInput class only works with actual files, so the readlines function should always return at least as many bytes as its first parameter. Is this assumption wrong?

    @bitdancer
    Copy link
    Member

    fileinput should work (for some definition of work) for anything that can be opened by name using the open syscall on unix. That includes many more things than files. Named pipes are a particularly interesting example in this context.

    @bitdancer
    Copy link
    Member

    So the real question is: does readlines block until the byte count is satisified? It might, but the docs for io.IOBase.readlines leave open the possibility that fewer lines will be read, and do not limit that to the EOF case. It's not clear, however, if that is because the non-EOF-short-read case is specifically being allowed for, or if the documenter just didn't consider that case.

    @bitdancer
    Copy link
    Member

    The _pyio.py version of readlines does read until the count is equaled or exceeded. This could, however, be an implementation detail and not part of the spec.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 15, 2012

    Le vendredi 15 juin 2012 à 14:41 +0000, Serhiy Storchaka a écrit :

    >From io.RawIOBase.read docs:

    """
    Read up to n bytes from the object and return them. As a convenience, if
    n is unspecified or -1, readall() is called. Otherwise, only one system
    call is ever made. Fewer than n bytes may be returned if the operating
    system call returns fewer than n bytes.

    But sys.stdin does not implement RawIOBase, it implements TextIOBase.

    @serhiy-storchaka
    Copy link
    Member

    Forget other filelike objects. The FileInput class only works with actual files,

    No. sys.stdin can be reassigned before using FileInput. And FileInput
    has openhook parameter (for read compressed files or get files from Web,
    for example).

    so the readlines function should always return at least as many bytes as its first parameter. Is this assumption wrong?

    qwert
    'qwert\n'

    You type five characters "qwert" end press <Enter>. Python immediately
    receives these six characters, and returns a result of
    sys.stdin.readline(1000). Only six characters, and no one symbol more,
    because more characters you have not entered yet.

    I believe that for such questions will be more appropriate to use a
    mailing list (python-list@python.org, or newsgroup
    gmane.comp.python.general on news://news.gmane.org), and not bugtracker.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 15, 2012

    > so the readlines function should always return at least as many bytes as its first parameter. Is this assumption wrong?

    qwert
    'qwert\n'

    You type five characters "qwert" end press <Enter>. Python immediately
    receives these six characters, and returns a result of
    sys.stdin.readline(1000).

    Well, did you try readline() or readlines()?

    @serhiy-storchaka
    Copy link
    Member

    But sys.stdin does not implement RawIOBase, it implements TextIOBase.

    sys.stdin.buffer.raw implements RawIOBase.

    @serhiy-storchaka
    Copy link
    Member

    >
    > qwert
    > 'qwert\n'

    Oh, it seems that the mail server again ate some lines of my examples.

    Well, did you try readline() or readlines()?

    Yes, it's my mistake, I used readline().

    @pitrou
    Copy link
    Member

    pitrou commented Jun 15, 2012

    Oh, it seems that the mail server again ate some lines of my examples.

    This is a bug in the e-mail gateway. You can lobby for a fix at
    http://psf.upfronthosting.co.za/roundup/meta/issue264

    @serhiy-storchaka
    Copy link
    Member

    Using readlines() instead of readline() was added in 4dbbf322a9df for performance. But it looks that now this is not needed. Naive implementation with readline() is about 2 times slower, but with careful optimization we can achieve the same performance (or better).

    Here are results of benchmarks.

    Unpatched:

    $ mkdir testdir
    $ for i in `seq 10`; do for j in `seq 1000`; do echo "$j"; done >"testdir/file$i"; done
    $ ./python -m timeit -s "import fileinput, glob; files = glob.glob('testdir/*')" -- "f = fileinput.input(files)" "while f.readline(): pass"
    10 loops, best of 3: 56.4 msec per loop
    $ ./python -m timeit -s "import fileinput, glob; files = glob.glob('testdir/*')" -- "list(fileinput.input(files))"10 loops, best of 3: 68.4 msec per loop

    Patched:

    $ ./python -m timeit -s "import fileinput, glob; files = glob.glob('testdir/*')" -- "f = fileinput.input(files)" "while f.readline(): pass"
    10 loops, best of 3: 47.4 msec per loop
    $ ./python -m timeit -s "import fileinput, glob; files = glob.glob('testdir/*')" -- "list(fileinput.input(files))"
    10 loops, best of 3: 63.1 msec per loop

    The patch also fixes original issue.

    It also fixes yet one issue. Currently lines are buffered and you need to enter many lines first then get first line:

    >>> import fileinput
    >>> fi = fileinput.input()
    >>> line = fi.readline()
    qwerty
    asdfgh
    zxcvbn
    ^D
    >>> line
    'qwerty\n'

    With the patch you get the line just as it entered.

    @serhiy-storchaka
    Copy link
    Member

    Benjamin, is it good to add PendingDeprecationWarning in 2.7?

    @benjaminp
    Copy link
    Contributor

    That individually is probably okay. It's more a question of whether the
    entire change is appropriate for 2.7.

    Note PendingDeprecationWarning is fairly useless, since it's rarely
    enabled.

    On Sat, Dec 19, 2015, at 00:34, Serhiy Storchaka wrote:

    Serhiy Storchaka added the comment:

    Benjamin, is it good to add PendingDeprecationWarning in 2.7?

    ----------
    nosy: +benjamin.peterson
    versions: -Python 3.4


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue15068\>


    @serhiy-storchaka
    Copy link
    Member

    It's more a question of whether the entire change is appropriate for 2.7.

    What is your answer? To me there is a bug and we can fix it.

    @serhiy-storchaka
    Copy link
    Member

    Ping.

    @serhiy-storchaka
    Copy link
    Member

    Committed in changesets 5fbd16326353 (2.7), 9ead3a6c5f81 (3.5), and fefedbaac640 (default). Due to SMTP failure there is no Roundup report.

    Warnings are not emitted in maintained releases.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 8, 2016

    It seems this change is causing some (intermittent?) buildbot failures in 2.7:

    http://buildbot.python.org/all/builders/s390x%20RHEL%202.7/builds/273/steps/test/logs/stdio

    ======================================================================
    FAIL: test_saveall (test.test_gc.GCTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/dje/cpython-buildarea/2.7.edelsohn-rhel-z/build/Lib/test/test_gc.py", line 199, in test_saveall
        self.assertEqual(gc.garbage, [])
    AssertionError: Lists differ: [<fileinput.FileInput instance... != []

    First list contains 28 additional elements.
    First extra element 0:
    <fileinput.FileInput instance at 0x3fff6821a68>

    Diff is 1461 characters long. Set self.maxDiff to None to see it.

    ======================================================================
    FAIL: test_create_read (test.test_csv.TestLeaks)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/dje/cpython-buildarea/2.7.edelsohn-rhel-z/build/Lib/test/test_csv.py", line 1103, in test_create_read
        self.assertEqual(gc.garbage, [])
    AssertionError: Lists differ: [<fileinput.FileInput instance... != []

    First list contains 28 additional elements.
    First extra element 0:
    <fileinput.FileInput instance at 0x3fff6821a68>

    Diff is 1461 characters long. Set self.maxDiff to None to see it.

    @serhiy-storchaka
    Copy link
    Member

    Ah, thanks Martin. I forgot that assigning an attribute to a bound method creates a reference loop.

    This can be fixed without performance lost by using a clever trick.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 8, 2016

    New changeset 88d6742aa99a by Serhiy Storchaka in branch '2.7':
    Issue bpo-15068: Avoid creating a reference loop in fileinput.
    https://hg.python.org/cpython/rev/88d6742aa99a

    New changeset a0de41b46aa6 by Serhiy Storchaka in branch '3.5':
    Issue bpo-15068: Avoid creating a reference loop in fileinput.
    https://hg.python.org/cpython/rev/a0de41b46aa6

    New changeset 27c9849ba5f3 by Serhiy Storchaka in branch 'default':
    Issue bpo-15068: Avoid creating a reference loop in fileinput.
    https://hg.python.org/cpython/rev/27c9849ba5f3

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants