classification
Title: codecs.StreamWriter.writelines problem when passed generator
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, dlesco, facundobatista, lemburg, pitrou, terry.reedy
Priority: normal Keywords:

Created on 2009-03-08 19:31 by dlesco, last changed 2013-03-25 23:08 by benjamin.peterson. This issue is now closed.

Messages (10)
msg83322 - (view) Author: Daniel Lescohier (dlesco) Date: 2009-03-08 19:31
This is the implementation of codecs.Streamwriter.writelines for all 
Python versions that I've checked:

    def writelines(self, list):

        """ Writes the concatenated list of strings to the stream
            using .write().
        """
        self.write(''.join(list))

This may be a problem if the 'list' parameter is a generator. The 
generator may be returning millions of values, which the join will 
concatenate in memory. It can surprise the programmer with large 
memory use. I think join should only be used when you know the size of 
your input, and this method does not know this. I think the safe 
implementation of this method would be:

    def writelines(self, list):

        """ Writes the concatenated list of strings to the stream
            using .write().
        """
        write = self.write
        for value in list:
            write(value)

If a caller knows that it's input list would use a reasonable amount 
of memory, it can get the same functionality as before by doing 
stream.write(''.join(list)).
msg83408 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2009-03-09 21:54
When fixing this, note that the builtin name "list" should not be
overwritten by the argument name.
msg83419 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-03-10 07:33
For the common case where list is in fact a sequence of strings, the
used implementation is a lot faster and more efficient than the one you
propose.

Note that the method doesn't pretend to support generators for the list
argument, so adding support for iterators would be a feature request,
not a bug report. 

Furthermore, the StreamWriter method can easily be overridden by the
codec's own implementation (the version in codecs.py is the base method
defining the interface and providing a default implementation), so
adding support to the base method will not necessarily mean that all
codecs then support iterators as parameter to .writelines().

IMHO, it's better to use the .write() method in a for-loop of your
application if you want to support iterators that generate a lot of output.
msg83435 - (view) Author: Daniel Lescohier (dlesco) Date: 2009-03-10 14:48
In Python's file protocol, readlines and writelines is a protocol for 
iterating over a file. In Python's file protocol, if one doesn't want 
to iterate over the file, one calls read() with no argument in order 
to read the whole file in, or one calls write() with the complete 
contents you want to write.

If writelines is using join, then if one passes an iterator as the 
parameter to writelines, it will not iteratively write to the file, it 
will accumulate everything in memory until the iterator raises 
StopIteration, and then write to the file.  So, if one is tailing the 
output file, one is not going to see anything in the file until the 
end, instead of iteratively seeing content.  So, it's breaking the 
promise of the file protocol's writelines meaning iteratively write.

I think following the protocol is more important than performance. If 
the application is having performance problems, it's up to the 
application to buffer the data in memory and make a single write call.

However, here is an alternative implementation that is slightly more 
complicated, but possibly has better performance for the passed-a-list 
case.  It covers three cases:

1. Passed an empty sequence; do not call self.write at all.
2. Passed a sequence with a length. That implies that all the data is 
available immediately, so one can concantenate and write with one 
self.write call.
3. Passed a sequence with no length.  That implies that all the data 
is not available immediately, so iteratively write it.

    def writelines(self, sequence):

        """ Writes the sequence of strings to the stream
            using .write().
        """
        try:
            sequence_len = len(sequence)
        except TypeError:
            write = self.write
            for value in sequence:
                write(value)
            return
        if sequence_len:
            self.write(''.join(sequence))

I'm not sure which is better.  But one last point is that Python is 
moving more in the direction of using iterators; e.g., in Py3K, 
replacing dict's keys, values, and items with the implementation of 
iterkeys, itervalues, and iteritems.
msg83436 - (view) Author: Daniel Lescohier (dlesco) Date: 2009-03-10 15:36
Let me give an example of why it's important that writelines 
iteratively writes.  For:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
out_file.writelines('\t'.join(row)+'\n' for row in filtered)

For a large input file, for a regular out_file object, this will work. 
For a codecs.StreamWriter wrapped out_file object, this won't work, 
because it's not following the file protocol that writelines should 
iteratively write.
msg83437 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-03-10 15:52
On 2009-03-10 16:36, Daniel Lescohier wrote:
> Daniel Lescohier <daniel.lescohier@cbs.com> added the comment:
> 
> Let me give an example of why it's important that writelines 
> iteratively writes.  For:
> 
> rows = (line[:-1].split('\t') for line in in_file)
> projected = (keep_fields(row, 0, 3, 7) for row in rows)
> filtered = (row for row in projected if row[2]=='1')
> out_file.writelines('\t'.join(row)+'\n' for row in filtered)
> 
> For a large input file, for a regular out_file object, this will work. 
> For a codecs.StreamWriter wrapped out_file object, this won't work, 
> because it's not following the file protocol that writelines should 
> iteratively write.

Of course, it's possible to have a generator producing lots of data,
but that does not warrant making most uses of that method slow.

If you'd like to see such support in .writelines(), please provide
an implementation that follows the approach of the file object
implementation in Python 2.x: it writes the lines in chunks of
1000 lines each, if it finds that the input object is not a sequence
(actually, it's stricter than that for some: it requires a Python
list).

The standard case of passing a list of strings to that method
should not get slower because of this.

BTW: I am not aware of any .writelines() file protocol. If there
were such a protocol, .readlines() would also have to return
an iterator (which it doesn't). The idea behind .writelines()
is to be able to write back a list generated with .readlines().

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg83441 - (view) Author: Daniel Lescohier (dlesco) Date: 2009-03-10 17:35
OK, I think I see where I went wrong in my perceptions of the file 
protocol.  I thought that readlines() returned an iterator, not a 
list, but I see in the library reference manual on File Objects that 
it returns a list.  I think I got confused because there is no 
equivalent of __iter__ for writing to streams.  For input, I'm always 
using 'for line in file_object' (in other words, 
file_object.__iter__), so I had assumed that writelines was the mirror 
image of that, because I never use the readlines method.  Then, in my 
mind, readlines became the mirror image of writelines, which I had 
assumed took an iterator, so I assumed that readlines returned an 
iterator.  I wonder if this perception problem is common or not.

So, the StreamWriter interface matches the file protocol; readlines() 
and writelines() deal with lists.  There shouldn't be any change to 
it, because it follows the protocol.

Then, the example I wrote would be instead:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
formatted = (u'\t'.join(row)+'\n' for row in filtered)
write = out_file.write
for line in formatted:
    write(line)

I think it's correct that the file object write C code only does 
1000-line chunks for sequences that have a defined length: if it has a 
defined length, then that implies that the data exists now, and can be 
concatenated and written now.  Something without a defined length may 
be a generator with items arriving later.
msg185242 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-03-25 22:11
I find this to be a confusing issue.

1. The arg for .join is any iterable of strings, not just a list. (I think Daniel got that.) Since codecs.py still has "self.write(''.join(list))":
  a) calling the arg 'list' is doubly wrong.
  b) there is no iterable feature to add, at least not to the base class.

2. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)."  For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree.

3. The iobase.writelines entry says "Write a list of lines to the stream." For StreamIO, iter(somelist) works just as well. If that applies to .writelines in general, the doc should say 'iterable of lines'. (Antoine, Benjamin?)

If io.writelines does the same (concatenate all together), it should say so, but one comment suggests it does not 

4. If an iterable produces millions of lines, it does not matter if it is a list, generator, or any other iterable. So the title does not point to the real problem, if there is one.
msg185244 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-03-25 22:42
On 25.03.2013 23:11, Terry J. Reedy wrote:
> 2. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)."  For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree.

The doc string of the base class merely describes what the
.writelines() method will be doing. It's not meant as a requirement.

In one of the previous messages, I've already outlined a workable
approach to solve the case that the original request was addressing.
There isn't much point in making the common case slower just
to address a theoretical corner case.

Since no patch has since been submitted, I'd suggest to close the
ticket as won't fix or out of date.
msg185245 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-03-25 23:08
Agreed. Modern consumers can use the io module.
History
Date User Action Args
2013-03-25 23:08:22benjamin.petersonsetstatus: open -> closed
resolution: wont fix
messages: + msg185245
2013-03-25 22:42:03lemburgsetmessages: + msg185244
2013-03-25 22:11:27terry.reedysetnosy: + terry.reedy, pitrou, benjamin.peterson

messages: + msg185242
stage: patch review
2010-07-09 02:58:03terry.reedysetversions: + Python 3.2, - Python 2.6, Python 2.5, Python 3.0, Python 2.7
2009-03-10 17:36:01dlescosetmessages: + msg83441
2009-03-10 15:52:54lemburgsetmessages: + msg83437
2009-03-10 15:36:40dlescosetmessages: + msg83436
2009-03-10 14:48:51dlescosetmessages: + msg83435
2009-03-10 07:33:35lemburgsetnosy: + lemburg
type: enhancement
messages: + msg83419
2009-03-09 21:54:07facundobatistasetnosy: + facundobatista
messages: + msg83408
2009-03-08 19:31:15dlescocreate