Message 9483 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jvr
Recipients
Date	2002-03-02.15:44:15
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Given a file created with this snippet: >>> f = open("tmp.txt", "w") >>> for i in range(10000): ... f.write("%s\n" % i) ... >>> f.close() Iterating over a file multiple times has unexpected behavior: >>> f = open("tmp.txt") >>> for line in f: ... print line.strip() ... break ... 0 >>> for line in f: ... print line.strip() ... break ... 1861 >>> I expected the last output line to be 1 instead of 1861. While I understand the cause (xreadlines being used by the file iterator, it reads a big chunk ahead, causing the actual filepos to be out of sync), this seems to be an undocumented gotcha. The docs say this: [ ... ] Each iteration returns the same result as file.readline(), and iteration ends when the readline() method returns an empty string. That is true within one for loop, but not when you break out of the loop and start another one, which I think is a valid idiom. Another example of breakage: f = open(...) for line in f: if somecondition(line): break ... data = f.read() # read rest in one slurp The fundamental problem IMO is that the file iterator stacks another state on top of an already stateful object. In a sense a file object is already an iterator. The two states get out of sync, causing confusing semantics, to say the least. The current behavior exposes an implementation detail that should be hidden. I understand that speed is a major issue here, so a solution might not be simple. Here's a report from an actual user: http://groups.google.com/groups?hl=en&selm= owen- 0B3ECB.10234615022002%40nntp2.u.washingto n.edu The rest of the thread suggests possible solutions. Here's what I think should happen (but: I'm hardly aware of both the fileobject and xreadline innards) is this: xreadlines should be merged with the file object. The buffer that xreadlines implements should be the buffer for the file object, and all read methods should use * that* buffer and the according filepos. Maybe files should grow a .next() method, so iter(f) can return f itself. .next() and .readline() are then 100% equivalent.

Given a file created with this snippet:

  >>> f = open("tmp.txt", "w")
  >>> for i in range(10000):
  ...     f.write("%s\n" % i)
  ... 
  >>> f.close()

Iterating over a file multiple times has unexpected 
behavior:

  >>> f = open("tmp.txt")
  >>> for line in f:
  ...     print line.strip()
  ...     break
  ... 
  0
  >>> for line in f:
  ...     print line.strip()
  ...     break
  ... 
  1861
  >>> 

I expected the last output line to be 1 instead of 
1861.

While I understand the cause (xreadlines being 
used by the
file iterator, it reads a big chunk ahead, causing 
the actual
filepos to be out of sync), this seems to be an 
undocumented
gotcha. The docs say this:

  [ ... ] Each iteration returns the same result as
  file.readline(), and iteration ends when the 
readline()
  method returns an empty string. 

That is true within one for loop, but not when you 
break out
of the loop and start another one, which I think is a 
valid
idiom.

Another example of breakage:

  f = open(...)
  for line in f:
      if somecondition(line):
	  break
      ...
  
  data = f.read()  # read rest in one slurp

The fundamental problem IMO is that the file 
iterator stacks
*another* state on top of an already stateful object. 
In a
sense a file object is already an iterator. The two 
states get
out of sync, causing confusing semantics, to say 
the least.
The current behavior exposes an implementation 
detail that
should be hidden.

I understand that speed is a major issue here, so 
a solution
might not be simple.

Here's a report from an actual user:
http://groups.google.com/groups?hl=en&selm=
owen-
0B3ECB.10234615022002%40nntp2.u.washingto
n.edu
The rest of the thread suggests possible 
solutions.

Here's what I *think* should happen (but: I'm 
hardly aware
of both the fileobject and xreadline innards) is this:
xreadlines should be merged with the file object. 
The buffer
that xreadlines implements should be *the* buffer 
for the
file object, and *all* read methods should use *
that* buffer
and the according filepos.

Maybe files should grow a .next() method, so iter(f) 
can return
f itself. .next() and .readline() are then 100% 
equivalent.

History
Date	User	Action	Args
2007-08-23 13:59:32	admin	link	issue524804 messages
2007-08-23 13:59:32	admin	create