Created on 2009-03-08 19:31 by dlesco, last changed 2013-03-25 23:08 by benjamin.peterson. This issue is now closed.
|msg83322 - (view)||Author: Daniel Lescohier (dlesco)||Date: 2009-03-08 19:31|
This is the implementation of codecs.Streamwriter.writelines for all Python versions that I've checked: def writelines(self, list): """ Writes the concatenated list of strings to the stream using .write(). """ self.write(''.join(list)) This may be a problem if the 'list' parameter is a generator. The generator may be returning millions of values, which the join will concatenate in memory. It can surprise the programmer with large memory use. I think join should only be used when you know the size of your input, and this method does not know this. I think the safe implementation of this method would be: def writelines(self, list): """ Writes the concatenated list of strings to the stream using .write(). """ write = self.write for value in list: write(value) If a caller knows that it's input list would use a reasonable amount of memory, it can get the same functionality as before by doing stream.write(''.join(list)).
|msg83408 - (view)||Author: Facundo Batista (facundobatista) *||Date: 2009-03-09 21:54|
When fixing this, note that the builtin name "list" should not be overwritten by the argument name.
|msg83419 - (view)||Author: Marc-Andre Lemburg (lemburg) *||Date: 2009-03-10 07:33|
For the common case where list is in fact a sequence of strings, the used implementation is a lot faster and more efficient than the one you propose. Note that the method doesn't pretend to support generators for the list argument, so adding support for iterators would be a feature request, not a bug report. Furthermore, the StreamWriter method can easily be overridden by the codec's own implementation (the version in codecs.py is the base method defining the interface and providing a default implementation), so adding support to the base method will not necessarily mean that all codecs then support iterators as parameter to .writelines(). IMHO, it's better to use the .write() method in a for-loop of your application if you want to support iterators that generate a lot of output.
|msg83435 - (view)||Author: Daniel Lescohier (dlesco)||Date: 2009-03-10 14:48|
In Python's file protocol, readlines and writelines is a protocol for iterating over a file. In Python's file protocol, if one doesn't want to iterate over the file, one calls read() with no argument in order to read the whole file in, or one calls write() with the complete contents you want to write. If writelines is using join, then if one passes an iterator as the parameter to writelines, it will not iteratively write to the file, it will accumulate everything in memory until the iterator raises StopIteration, and then write to the file. So, if one is tailing the output file, one is not going to see anything in the file until the end, instead of iteratively seeing content. So, it's breaking the promise of the file protocol's writelines meaning iteratively write. I think following the protocol is more important than performance. If the application is having performance problems, it's up to the application to buffer the data in memory and make a single write call. However, here is an alternative implementation that is slightly more complicated, but possibly has better performance for the passed-a-list case. It covers three cases: 1. Passed an empty sequence; do not call self.write at all. 2. Passed a sequence with a length. That implies that all the data is available immediately, so one can concantenate and write with one self.write call. 3. Passed a sequence with no length. That implies that all the data is not available immediately, so iteratively write it. def writelines(self, sequence): """ Writes the sequence of strings to the stream using .write(). """ try: sequence_len = len(sequence) except TypeError: write = self.write for value in sequence: write(value) return if sequence_len: self.write(''.join(sequence)) I'm not sure which is better. But one last point is that Python is moving more in the direction of using iterators; e.g., in Py3K, replacing dict's keys, values, and items with the implementation of iterkeys, itervalues, and iteritems.
|msg83436 - (view)||Author: Daniel Lescohier (dlesco)||Date: 2009-03-10 15:36|
Let me give an example of why it's important that writelines iteratively writes. For: rows = (line[:-1].split('\t') for line in in_file) projected = (keep_fields(row, 0, 3, 7) for row in rows) filtered = (row for row in projected if row=='1') out_file.writelines('\t'.join(row)+'\n' for row in filtered) For a large input file, for a regular out_file object, this will work. For a codecs.StreamWriter wrapped out_file object, this won't work, because it's not following the file protocol that writelines should iteratively write.
|msg83437 - (view)||Author: Marc-Andre Lemburg (lemburg) *||Date: 2009-03-10 15:52|
On 2009-03-10 16:36, Daniel Lescohier wrote: > Daniel Lescohier <firstname.lastname@example.org> added the comment: > > Let me give an example of why it's important that writelines > iteratively writes. For: > > rows = (line[:-1].split('\t') for line in in_file) > projected = (keep_fields(row, 0, 3, 7) for row in rows) > filtered = (row for row in projected if row=='1') > out_file.writelines('\t'.join(row)+'\n' for row in filtered) > > For a large input file, for a regular out_file object, this will work. > For a codecs.StreamWriter wrapped out_file object, this won't work, > because it's not following the file protocol that writelines should > iteratively write. Of course, it's possible to have a generator producing lots of data, but that does not warrant making most uses of that method slow. If you'd like to see such support in .writelines(), please provide an implementation that follows the approach of the file object implementation in Python 2.x: it writes the lines in chunks of 1000 lines each, if it finds that the input object is not a sequence (actually, it's stricter than that for some: it requires a Python list). The standard case of passing a list of strings to that method should not get slower because of this. BTW: I am not aware of any .writelines() file protocol. If there were such a protocol, .readlines() would also have to return an iterator (which it doesn't). The idea behind .writelines() is to be able to write back a list generated with .readlines(). Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
|msg83441 - (view)||Author: Daniel Lescohier (dlesco)||Date: 2009-03-10 17:35|
OK, I think I see where I went wrong in my perceptions of the file protocol. I thought that readlines() returned an iterator, not a list, but I see in the library reference manual on File Objects that it returns a list. I think I got confused because there is no equivalent of __iter__ for writing to streams. For input, I'm always using 'for line in file_object' (in other words, file_object.__iter__), so I had assumed that writelines was the mirror image of that, because I never use the readlines method. Then, in my mind, readlines became the mirror image of writelines, which I had assumed took an iterator, so I assumed that readlines returned an iterator. I wonder if this perception problem is common or not. So, the StreamWriter interface matches the file protocol; readlines() and writelines() deal with lists. There shouldn't be any change to it, because it follows the protocol. Then, the example I wrote would be instead: rows = (line[:-1].split('\t') for line in in_file) projected = (keep_fields(row, 0, 3, 7) for row in rows) filtered = (row for row in projected if row=='1') formatted = (u'\t'.join(row)+'\n' for row in filtered) write = out_file.write for line in formatted: write(line) I think it's correct that the file object write C code only does 1000-line chunks for sequences that have a defined length: if it has a defined length, then that implies that the data exists now, and can be concatenated and written now. Something without a defined length may be a generator with items arriving later.
|msg185242 - (view)||Author: Terry J. Reedy (terry.reedy) *||Date: 2013-03-25 22:11|
I find this to be a confusing issue. 1. The arg for .join is any iterable of strings, not just a list. (I think Daniel got that.) Since codecs.py still has "self.write(''.join(list))": a) calling the arg 'list' is doubly wrong. b) there is no iterable feature to add, at least not to the base class. 2. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)." For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree. 3. The iobase.writelines entry says "Write a list of lines to the stream." For StreamIO, iter(somelist) works just as well. If that applies to .writelines in general, the doc should say 'iterable of lines'. (Antoine, Benjamin?) If io.writelines does the same (concatenate all together), it should say so, but one comment suggests it does not 4. If an iterable produces millions of lines, it does not matter if it is a list, generator, or any other iterable. So the title does not point to the real problem, if there is one.
|msg185244 - (view)||Author: Marc-Andre Lemburg (lemburg) *||Date: 2013-03-25 22:42|
On 25.03.2013 23:11, Terry J. Reedy wrote: > 2. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)." For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree. The doc string of the base class merely describes what the .writelines() method will be doing. It's not meant as a requirement. In one of the previous messages, I've already outlined a workable approach to solve the case that the original request was addressing. There isn't much point in making the common case slower just to address a theoretical corner case. Since no patch has since been submitted, I'd suggest to close the ticket as won't fix or out of date.
|msg185245 - (view)||Author: Benjamin Peterson (benjamin.peterson) *||Date: 2013-03-25 23:08|
Agreed. Modern consumers can use the io module.
|2013-03-25 23:08:22||benjamin.peterson||set||status: open -> closed|
resolution: wont fix
messages: + msg185245
|2013-03-25 22:42:03||lemburg||set||messages: + msg185244|
+ terry.reedy, pitrou, benjamin.peterson|
messages: + msg185242
stage: patch review
|2010-07-09 02:58:03||terry.reedy||set||versions: + Python 3.2, - Python 2.6, Python 2.5, Python 3.0, Python 2.7|
|2009-03-10 17:36:01||dlesco||set||messages: + msg83441|
|2009-03-10 15:52:54||lemburg||set||messages: + msg83437|
|2009-03-10 15:36:40||dlesco||set||messages: + msg83436|
|2009-03-10 14:48:51||dlesco||set||messages: + msg83435|
messages: + msg83419
messages: + msg83408