Issue524804
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2002-03-02 15:44 by jvr, last changed 2022-04-10 16:05 by admin. This issue is now closed.
Messages (9) | |||
---|---|---|---|
msg9483 - (view) | Author: Just van Rossum (jvr) * | Date: 2002-03-02 15:44 | |
Given a file created with this snippet: >>> f = open("tmp.txt", "w") >>> for i in range(10000): ... f.write("%s\n" % i) ... >>> f.close() Iterating over a file multiple times has unexpected behavior: >>> f = open("tmp.txt") >>> for line in f: ... print line.strip() ... break ... 0 >>> for line in f: ... print line.strip() ... break ... 1861 >>> I expected the last output line to be 1 instead of 1861. While I understand the cause (xreadlines being used by the file iterator, it reads a big chunk ahead, causing the actual filepos to be out of sync), this seems to be an undocumented gotcha. The docs say this: [ ... ] Each iteration returns the same result as file.readline(), and iteration ends when the readline() method returns an empty string. That is true within one for loop, but not when you break out of the loop and start another one, which I think is a valid idiom. Another example of breakage: f = open(...) for line in f: if somecondition(line): break ... data = f.read() # read rest in one slurp The fundamental problem IMO is that the file iterator stacks *another* state on top of an already stateful object. In a sense a file object is already an iterator. The two states get out of sync, causing confusing semantics, to say the least. The current behavior exposes an implementation detail that should be hidden. I understand that speed is a major issue here, so a solution might not be simple. Here's a report from an actual user: http://groups.google.com/groups?hl=en&selm= owen- 0B3ECB.10234615022002%40nntp2.u.washingto n.edu The rest of the thread suggests possible solutions. Here's what I *think* should happen (but: I'm hardly aware of both the fileobject and xreadline innards) is this: xreadlines should be merged with the file object. The buffer that xreadlines implements should be *the* buffer for the file object, and *all* read methods should use * that* buffer and the according filepos. Maybe files should grow a .next() method, so iter(f) can return f itself. .next() and .readline() are then 100% equivalent. |
|||
msg9484 - (view) | Author: Jason Orendorff (jorend) | Date: 2002-03-04 06:47 | |
Logged In: YES user_id=18139 Agreed on all points of fact. Also +1 on fixing it by making iter(f).next() and f.readline() equivalent, one way or another. ...The easy way: make f.__iter__() call readline() instead of being so aggressive. (Better than nothing, in my view.) ...The hard way (JvR's proposal): add a level of input buffering on top of what the C runtime provides. xreadlines() breaks user expectations precisely because it does this *halfway*. Doing it right would not be such a maintenance burden, I think. I'm willing to write the patch, although others wiser in the ways of diverse stdio implementations (<wink>) might want to supervise. ...As it stands, iter(f) seems like a broken optimization. Which is to say: it's not "undocumented surprising behavior"; it's a bug. |
|||
msg9485 - (view) | Author: Jeremy Hylton (jhylton) | Date: 2002-03-08 01:06 | |
Logged In: YES user_id=31392 If I understand the checkin message Guido wrote for 2.113, he didn't intend the current behavior. > file_getiter(): make iter(file) be equivalent to >file.xreadlines(). > This should be faster. > > This means: > > (1) "for line in file:" won't work if the xreadlines module can't be > imported. > > (2) The body of "for line in file:" shouldn't use the file directly; > the effects (e.g. of file.readline(), file.seek() or even > file.tell()) would be undefined because of the buffering that goes > on in the xreadlines module. |
|||
msg9486 - (view) | Author: Tim Peters (tim.peters) * | Date: 2002-03-08 01:16 | |
Logged In: YES user_id=31435 I'm sure Guido was aware of this. Making the simplest-to- spell idiom as fast as possible was a deliberate decision at the time. |
|||
msg9487 - (view) | Author: Just van Rossum (jvr) * | Date: 2002-03-08 09:08 | |
Logged In: YES user_id=92689 At the cost of, what, sensible, predictable semantics? - fast is better than slow - but slow is better than unpredictable Or something... |
|||
msg9488 - (view) | Author: Jason Orendorff (jorend) | Date: 2002-03-08 09:35 | |
Logged In: YES user_id=18139 Tim wrote: "I'm sure Guido was aware of this." ...Wow, really? That kind of puts a damper on my theory. I guess it can't be a bug after all. :) Tim wrote: "Making the simplest-to-spell idiom as fast as possible was a deliberate decision at the time." ...*That* I believe. |
|||
msg9489 - (view) | Author: Matthias Urlichs (smurf) | Date: 2002-03-19 17:37 | |
Logged In: YES user_id=10327 I'm in favor for the "let files be their own iterator and set .next equal to .readline" solution. The above example can _really_ bite you. The current xreadlinesmodule.c could be converted to a standalone module, if it really is necessary to optimize. The trivial solution for this problem is to change CHUNKSIZE (in Modules/xreadlinesmodule) to 1. Or, even better, to convert it into an instance variable, so you can do this: f=open(...) fi=f.iter(chunk=2000) for line in fi: ... if you want speed, or just write for line in f: (which internally converts to f.iter(chunk=1)) if you want safety. I'm not too firm with Python C interfacing, otherwise I'd write a patch... any takers? |
|||
msg9490 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2002-03-20 03:43 | |
Logged In: YES user_id=6380 There are two forces at work here. You want the most common case (a single "for line in file" that consumes the whole file) to run blindingly fast. And you want full generality, basically equating next() with readline() (except raising StopIteration on EOF). Unfortunately, the only way to go blindingly fast is to do aggressive buffering, and that's what xreadlines does. Until we rewrite the entire I/O system so we have total control over buffering ourselves, it's not easy to mix xreadlines with other operations (readline, seek, tell). We could make the default file iterator use readline, but then it would become much slower, and we'd have to teach people about xreadlines if they want speed. Or we could use the current solution, where speed is the default, and you have to be more careful when you use an unusual coding style, like breaking out of the for loop and continuing in a second for loop. I'm not sure which requirement is more common (speed, or generality), but since looping over all the lines of a (big) file is such a common pattern, I would bet the speed is the more important of the two. In the past we've had a lot of flak about the slowness of the general solution of looping over all lines in a file; the xreadlines-based iterator is much faster, and I am reluctant to change this back in Python 2.3; I'd rather document it carefully (after all, "for line in lines" is a new construct in Python 2.2, and people have to be told about it; we might as well tell them about the limitations and how to work around them). |
|||
msg9491 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2002-03-23 03:40 | |
Logged In: YES user_id=6380 Closing as won't fix. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:05:03 | admin | set | github: 36189 |
2002-03-02 15:44:15 | jvr | create |