Title: File protocol should document if writelines must handle generators sensibly
Type: Stage: needs patch
Components: Documentation, IO Versions: Python 3.4, Python 3.5, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: JanKanis, benjamin.peterson, dhaffey, dlesco, docs@python, hynek, josh.r, lemburg, pitrou, stutzbach, terry.reedy
Priority: normal Keywords:

Created on 2014-07-03 09:38 by JanKanis, last changed 2016-03-12 00:51 by martin.panter.

Messages (4)
msg222165 - (view) Author: Jan Kanis (JanKanis) Date: 2014-07-03 09:38
The resolution of issue 5445 should be documented somewhere properly, so people can depend on it or not.

IOBase.writelines handles generator arguments without problems, i.e. without first draining the entire generator and then writing the result in one go. That would require large amounts of memory if the generator is large, and fail entirely if the generator is infinite. 

codecs.StreamWriter.writelines uses self.write(''.join(argument)) as implementation, which fails on very large or infinite arguments.

According to issue 5445 it is not part of the file protocol that .writelines must handle (large/infinite) generators, only list-like iterables. However as far as I know this is not documented anywhere, and sometimes people assume that writelines is meant for this case. E.g. jinja (, the dump method is explicitly documented to stream). The guarantees that .writelines makes or does not make in this regard should be documented somewhere, so that either .writeline implementations that don't handle large generators can be pointed out as bugs, or code that makes assumptions on .writeline handling large generators can be.

I personally think .writelines should handle large generators, since in the python 3 world a lot of apis were iterator-ified and it is wat a lot of people would probably expect. But having a clear and documented decision on this is more important. 

(note: I've copied most of the nosy list from #5445)
msg222252 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-07-04 00:48
+1. I've been assuming writelines handled arbitrary generators without an issue; guess I've gotten lucky and only used the ones that do. I've fed stuff populated by enormous (though not infinite) generators created from stuff like itertools.product and the like into it on the assumption that it would safely write it without generating len(seq) ** repeat values in memory.

I'd definitely appreciate a documented guarantee of this. I don't need it to explicitly guarantee that each item is written before the next item is pulled off the iterator or anything; if it wants to buffer a reasonable amount of data in memory before triggering a real I/O that's fine (generators returning mutable objects and mutating them when the next object comes along are evil anyway, and forcing one-by-one output can prevent some useful optimizations). But anything that uses argument unpacking, collection as a list, ''.join (or at the C level, PySequence_Fast and the like), forcing the whole generator to exhaust before writing byte one, is a bad idea.
msg226120 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-08-30 04:17
Security fix only versions do not get doc fixes.
msg261399 - (view) Author: Dan Haffey (dhaffey) Date: 2016-03-09 03:00
+1, I just lost an hour-plus compute job to this. It sure violates POLA. I've been passing large generators to file.writelines since about as long as generators have existed, so I never would have guessed that a class named "StreamWriter" of all things wouldn't, you know, stream its writelines argument.
Date User Action Args
2016-03-12 00:51:09martin.pantersetstage: needs patch
2016-03-09 03:00:24dhaffeysetnosy: + dhaffey
messages: + msg261399
2014-08-30 04:17:47terry.reedysetmessages: + msg226120
versions: - Python 3.1, Python 3.2, Python 3.3
2014-07-04 00:48:40josh.rsetnosy: + josh.r
messages: + msg222252
2014-07-03 09:38:29JanKaniscreate