Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codecs.StreamWriter.writelines problem when passed generator #49695

Closed
dlesco mannequin opened this issue Mar 8, 2009 · 10 comments
Closed

codecs.StreamWriter.writelines problem when passed generator #49695

dlesco mannequin opened this issue Mar 8, 2009 · 10 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@dlesco
Copy link
Mannequin

dlesco mannequin commented Mar 8, 2009

BPO 5445
Nosy @malemburg, @terryjreedy, @facundobatista, @pitrou, @benjaminp

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-03-25.23:08:22.264>
created_at = <Date 2009-03-08.19:31:15.800>
labels = ['type-feature', 'library']
title = 'codecs.StreamWriter.writelines problem when passed generator'
updated_at = <Date 2013-03-25.23:08:22.263>
user = 'https://bugs.python.org/dlesco'

bugs.python.org fields:

activity = <Date 2013-03-25.23:08:22.263>
actor = 'benjamin.peterson'
assignee = 'none'
closed = True
closed_date = <Date 2013-03-25.23:08:22.264>
closer = 'benjamin.peterson'
components = ['Library (Lib)']
creation = <Date 2009-03-08.19:31:15.800>
creator = 'dlesco'
dependencies = []
files = []
hgrepos = []
issue_num = 5445
keywords = []
message_count = 10.0
messages = ['83322', '83408', '83419', '83435', '83436', '83437', '83441', '185242', '185244', '185245']
nosy_count = 6.0
nosy_names = ['lemburg', 'terry.reedy', 'facundobatista', 'pitrou', 'benjamin.peterson', 'dlesco']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue5445'
versions = ['Python 3.2']

@dlesco
Copy link
Mannequin Author

dlesco mannequin commented Mar 8, 2009

This is the implementation of codecs.Streamwriter.writelines for all
Python versions that I've checked:

    def writelines(self, list):
    """ Writes the concatenated list of strings to the stream
        using .write().
    """
    self.write(''.join(list))

This may be a problem if the 'list' parameter is a generator. The
generator may be returning millions of values, which the join will
concatenate in memory. It can surprise the programmer with large
memory use. I think join should only be used when you know the size of
your input, and this method does not know this. I think the safe
implementation of this method would be:

    def writelines(self, list):
    """ Writes the concatenated list of strings to the stream
        using .write().
    """
    write = self.write
    for value in list:
        write(value)

If a caller knows that it's input list would use a reasonable amount
of memory, it can get the same functionality as before by doing
stream.write(''.join(list)).

@dlesco dlesco mannequin added the stdlib Python modules in the Lib dir label Mar 8, 2009
@facundobatista
Copy link
Member

When fixing this, note that the builtin name "list" should not be
overwritten by the argument name.

@malemburg
Copy link
Member

For the common case where list is in fact a sequence of strings, the
used implementation is a lot faster and more efficient than the one you
propose.

Note that the method doesn't pretend to support generators for the list
argument, so adding support for iterators would be a feature request,
not a bug report.

Furthermore, the StreamWriter method can easily be overridden by the
codec's own implementation (the version in codecs.py is the base method
defining the interface and providing a default implementation), so
adding support to the base method will not necessarily mean that all
codecs then support iterators as parameter to .writelines().

IMHO, it's better to use the .write() method in a for-loop of your
application if you want to support iterators that generate a lot of output.

@malemburg malemburg added the type-feature A feature request or enhancement label Mar 10, 2009
@dlesco
Copy link
Mannequin Author

dlesco mannequin commented Mar 10, 2009

In Python's file protocol, readlines and writelines is a protocol for
iterating over a file. In Python's file protocol, if one doesn't want
to iterate over the file, one calls read() with no argument in order
to read the whole file in, or one calls write() with the complete
contents you want to write.

If writelines is using join, then if one passes an iterator as the
parameter to writelines, it will not iteratively write to the file, it
will accumulate everything in memory until the iterator raises
StopIteration, and then write to the file. So, if one is tailing the
output file, one is not going to see anything in the file until the
end, instead of iteratively seeing content. So, it's breaking the
promise of the file protocol's writelines meaning iteratively write.

I think following the protocol is more important than performance. If
the application is having performance problems, it's up to the
application to buffer the data in memory and make a single write call.

However, here is an alternative implementation that is slightly more
complicated, but possibly has better performance for the passed-a-list
case. It covers three cases:

  1. Passed an empty sequence; do not call self.write at all.
  2. Passed a sequence with a length. That implies that all the data is
    available immediately, so one can concantenate and write with one
    self.write call.
  3. Passed a sequence with no length. That implies that all the data
    is not available immediately, so iteratively write it.
    def writelines(self, sequence):
    """ Writes the sequence of strings to the stream
        using .write().
    """
    try:
        sequence_len = len(sequence)
    except TypeError:
        write = self.write
        for value in sequence:
            write(value)
        return
    if sequence_len:
        self.write(''.join(sequence))

I'm not sure which is better. But one last point is that Python is
moving more in the direction of using iterators; e.g., in Py3K,
replacing dict's keys, values, and items with the implementation of
iterkeys, itervalues, and iteritems.

@dlesco
Copy link
Mannequin Author

dlesco mannequin commented Mar 10, 2009

Let me give an example of why it's important that writelines
iteratively writes. For:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
out_file.writelines('\t'.join(row)+'\n' for row in filtered)

For a large input file, for a regular out_file object, this will work.
For a codecs.StreamWriter wrapped out_file object, this won't work,
because it's not following the file protocol that writelines should
iteratively write.

@malemburg
Copy link
Member

On 2009-03-10 16:36, Daniel Lescohier wrote:

Daniel Lescohier <daniel.lescohier@cbs.com> added the comment:

Let me give an example of why it's important that writelines
iteratively writes. For:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
out_file.writelines('\t'.join(row)+'\n' for row in filtered)

For a large input file, for a regular out_file object, this will work.
For a codecs.StreamWriter wrapped out_file object, this won't work,
because it's not following the file protocol that writelines should
iteratively write.

Of course, it's possible to have a generator producing lots of data,
but that does not warrant making most uses of that method slow.

If you'd like to see such support in .writelines(), please provide
an implementation that follows the approach of the file object
implementation in Python 2.x: it writes the lines in chunks of
1000 lines each, if it finds that the input object is not a sequence
(actually, it's stricter than that for some: it requires a Python
list).

The standard case of passing a list of strings to that method
should not get slower because of this.

BTW: I am not aware of any .writelines() file protocol. If there
were such a protocol, .readlines() would also have to return
an iterator (which it doesn't). The idea behind .writelines()
is to be able to write back a list generated with .readlines().

Thanks,

Marc-Andre Lemburg
eGenix.com


::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

@dlesco
Copy link
Mannequin Author

dlesco mannequin commented Mar 10, 2009

OK, I think I see where I went wrong in my perceptions of the file
protocol. I thought that readlines() returned an iterator, not a
list, but I see in the library reference manual on File Objects that
it returns a list. I think I got confused because there is no
equivalent of __iter__ for writing to streams. For input, I'm always
using 'for line in file_object' (in other words,
file_object.__iter__), so I had assumed that writelines was the mirror
image of that, because I never use the readlines method. Then, in my
mind, readlines became the mirror image of writelines, which I had
assumed took an iterator, so I assumed that readlines returned an
iterator. I wonder if this perception problem is common or not.

So, the StreamWriter interface matches the file protocol; readlines()
and writelines() deal with lists. There shouldn't be any change to
it, because it follows the protocol.

Then, the example I wrote would be instead:

rows = (line[:-1].split('\t') for line in in_file)
projected = (keep_fields(row, 0, 3, 7) for row in rows)
filtered = (row for row in projected if row[2]=='1')
formatted = (u'\t'.join(row)+'\n' for row in filtered)
write = out_file.write
for line in formatted:
    write(line)

I think it's correct that the file object write C code only does
1000-line chunks for sequences that have a defined length: if it has a
defined length, then that implies that the data exists now, and can be
concatenated and written now. Something without a defined length may
be a generator with items arriving later.

@terryjreedy
Copy link
Member

I find this to be a confusing issue.

  1. The arg for .join is any iterable of strings, not just a list. (I think Daniel got that.) Since codecs.py still has "self.write(''.join(list))":
    a) calling the arg 'list' is doubly wrong.
    b) there is no iterable feature to add, at least not to the base class.

  2. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)." For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree.

  3. The iobase.writelines entry says "Write a list of lines to the stream." For StreamIO, iter(somelist) works just as well. If that applies to .writelines in general, the doc should say 'iterable of lines'. (Antoine, Benjamin?)

If io.writelines does the same (concatenate all together), it should say so, but one comment suggests it does not

  1. If an iterable produces millions of lines, it does not matter if it is a list, generator, or any other iterable. So the title does not point to the real problem, if there is one.

@malemburg
Copy link
Member

On 25.03.2013 23:11, Terry J. Reedy wrote:

  1. The codecs.writelines entry says "Writes the concatenated list of strings to the stream (possibly by reusing the write() method)." For the base class, that is overly restrictive, but I gather that Marc-Andre does not want to require that all subclasses be as relaxed. I presume there must be a reason. That means that the feature request of doing so is rejected by him. In that respect, this should be closed unless other developers strongly disagree.

The doc string of the base class merely describes what the
.writelines() method will be doing. It's not meant as a requirement.

In one of the previous messages, I've already outlined a workable
approach to solve the case that the original request was addressing.
There isn't much point in making the common case slower just
to address a theoretical corner case.

Since no patch has since been submitted, I'd suggest to close the
ticket as won't fix or out of date.

@benjaminp
Copy link
Contributor

Agreed. Modern consumers can use the io module.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants