This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: file iterator "deemed broken"; can resume after StopIteration
Type: behavior Stage:
Components: Documentation Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: dalke, docs@python, pitrou
Priority: normal Keywords:

Created on 2015-02-12 18:48 by dalke, last changed 2022-04-11 14:58 by admin.

Messages (1)
msg235850 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2015-02-12 18:48
The file iterator is "deemed broken". As I don't think it should be made non-broken, I suggest the documentation should be changed to point out when file iteration is broken. I also think the term 'broken' is a label with needlessly harsh connotations and should be softened.

The iterator documentation uses the term 'broken' like this (quoting here from https://docs.python.org/3.4/library/stdtypes.html):

  Once an iterator’s __next__() method raises StopIteration,
  it must continue to do so on subsequent calls. Implementations
  that do not obey this property are deemed broken.

(Older versions comment "This constraint was added in Python 2.3; in Python 2.2, various iterators are broken according to this rule.")

An IOBase is supposed to support the iterator protocol (says https://docs.python.org/3.4/library/io.html#io.IOBase ). However, it does not, nor does the documentation say that it's broken in the face of a changing file (eg, when another process appends to a log file).

  % ./python.exe 
  Python 3.5.0a1+ (default:4883f9046b10, Feb 11 2015, 04:30:46) 
  [GCC 4.8.4] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> f = open("empty")
  >>> next(f)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  StopIteration
  >>>
  >>> ^Z
  Suspended
  % echo "Hello!" >> empty
  % fg
  ./python.exe

  >>> next(f)
  'Hello!\n'

This is apparently well-known behavior, as I've come across several references to it on various Python-related lists, including this one from Miles in 2008: https://mail.python.org/pipermail/python-list/2008-September/491920.html .

  Strictly speaking, file objects are broken iterators:

Fredrik Lundh in the same thread ( https://mail.python.org/pipermail/python-list/2008-September/521090.html ) says:

  it's a design guideline, not an absolute rule

The 7+ years of 'broken' behavior in Python suggests that /F is correct. But while 'broken' could be considered a meaningless label, it carries with it some rather negative connotations. It sounds like developers are supposed to make every effort to avoid broken code, when that's not something Python itself does. It also means that my code can be called "broken" solely because it assumed Python file iterators are non-broken. I am not happy when people say my code is broken.

It is entirely reasonable that a seek(0) would reset the state and cause next(it) to not continue to raise a StopIteration exception. However, errors can arise when using Python file objects, as an iterator, to parse a log file or any other files which are appended to by another process.

Here's an example of code that can break. It extracts the first and last elements of an iterator; more specifically, the first and last lines of a file. If there are no lines it returns None for both values; and if there's only one line then it returns the same line as both values.

  def get_first_and_last_elements(it):
    first = last = next(it, None)
    for last in it:
        pass
    return first, last

This code expects a non-broken iterator. If passed a file, and the file were 1) initially empty when the next() was called, and 2) appended to by the time Python reaches the for loop, then it's possible for first value to be None while last is a string.

This is unexpected, undocumented, and may lead to subtle errors.

There are work-arounds, like ensuring that the StopIteration only occurs once:

  def get_first_and_last_elements(it):
    first = last = next(it, None)
    if last is not None:
        for last in it:
            pass
    return first, last

but much existing code expects non-broken iterators, such as the Python example implementation at https://docs.python.org/2/library/itertools.html#itertools.dropwhile . (I have a reproducible failure using it, a fork(), and a file iterator with a sleep() if that would prove useful.)

Another option is to have a wrapper around file object iterators to keep raising StopIteration, like:

   def safe_iter(it):
       yield from it

   # -or-  (line for line in file_iter)

but people need to know to do this with file iterators or other potentially broken iterators. The current documentation does not say when file iterators are broken, and I don't know which other iterators are also broken.

I realize this is a tricky issue.

I don't think it's possible now to change the file's StopIteration behavior. I expect that there is code which depends on the current brokenness, the ability to seek() and re-iterate is useful, and the idea that next() returns text if and only if readline() is not empty is useful and well-entrenched. Pypy has the same behavior as CPython so any change will take some time to propagate to the other implementations.

Instead, I'm fine with a documentation change in io.html . It currently says:

  IOBase (and its subclasses) support the iterator protocol,
  meaning that an IOBase object can be iterated over yielding
  the lines in a stream. Lines are defined slightly differently
  depending on whether the stream is a binary stream (yielding
  bytes), or a text stream (yielding unicode strings). See
  readline() below.

I suggest adding something like:

  The file iterator does not completely follow the iterator protocol.
  If new data is added to the file after the iterator raises
  a StopIteration then next(file) will resume returning lines.
  The safest way to iterate over lines in a log file or other
  changing file is use a generator comprehension:

     (line for line in file)

  The iterator may also resume after using seek() to move
  the file position.

You'll note that I failed to use the term "broken". This should really start

   The file iterator is broken.

I find that term rather harsh, and since broken iterators are acceptable in Python, I suggest toning down or qualifying the use of "broken" in stdtypes.html. I have no suggestions for an improved version.
History
Date User Action Args
2022-04-11 14:58:12adminsetgithub: 67643
2015-07-21 07:29:07ethan.furmansetnosy: - ethan.furman
2015-03-02 07:43:48ezio.melottisetnosy: + pitrou
type: behavior
2015-02-12 18:55:28ethan.furmansetnosy: + ethan.furman
2015-02-12 18:48:54dalkecreate