classification
Title: make io.BufferedWriter observe max_buffer_size limits
Type: resource usage Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.7, Python 2.6
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: amaury.forgeotdarc, beazley, gregory.p.smith, gvanrossum, loewis, pitrou
Priority: normal Keywords: needs review, patch

Created on 2008-11-25 12:34 by beazley, last changed 2009-10-09 14:04 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
issue4428-io-bufwrite-gps04.diff gregory.p.smith, 2009-01-20 05:03 obey max_buffer_size and blocking semantics, with unit tests
issue4428-io-bufwrite-gps05.diff gregory.p.smith, 2009-02-01 06:26
Messages (15)
msg76408 - (view) Author: David M. Beazley (beazley) Date: 2008-11-25 12:34
The Buffered I/O interface in the io module has the user specify buffer 
limits such as size and max_buffer_size.   The first limit (size) is 
easy to understand as a buffering threshold at which writes will occur.  
However, no apparent attempt is made to strictly limit the internal 
buffer size to max_buffer_size.
   
In BuffererWriter.write(), one of the first operations is 

     self._write_buf.extend(b)

which simply extends the buffer by the full data being written.  If b 
happens to be a large string (e.g., megabytes or even the entire 
contents of a big file), then the internal I/O buffer makes a complete 
copy of the data, effectively doubling the memory requirements for 
carrying out the write operation.

I suppose most programmers might not notice given that everyone has 
gigabytes of RAM these days, but you certainly don't see this kind of 
buffering behavior in the operating system kernel or in the C library.

Some patch suggestions (details left to the maintainers of this module):

1. Don't extend self._write_buf by more than the max_buffer_size.

     fragment = b[:self.max_buffer_size - len(self._write_buf)]
     self._write_buf.extend(fragment)

2. For large blocking writes, simply carry out the remaining I/O 
   operations in the write() method instead of in the _flush_locked()
   method.   Try to use the original input data b as the data
   source instead of making copies of it.  And if you have to copy
   the data, don't do it all at once.
msg76571 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-11-29 00:54
I've attached my first attempt at fixing this as io-bufwrite-gps01.

Unfortunately it causes the Lib/test/test_io.py testThreads to fail:

======================================================================
FAIL: testThreads (__main__.BufferedReaderTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_io.py", line 456, in testThreads
    self.assertEqual(s.count(c), N)
AssertionError: 30 != 1000

I haven't read the test yet, but my guess is that the test is broken and
is blindly calling write() without checking the return value.
msg76573 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-11-29 01:01
Yep, the test was ignoring the return value from BufferedWriter.write. 
Fixed in the attached io-bufwrite-gps02.

This can wait for 3.0.1 and 2.6.1/2 (depending on the 2.6.x release
schedule).
msg76574 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-11-29 01:15
I do not agree. User code should not have to verify the return value of write().

When a big amount of data is passed to BufferedWriter.write,
- normal blocking streams should block until all data has been written.
(the implementation may call self.raw.write several times)
- non-blocking streams should raise BlockingIOError indicating the number of 
characters actually written.

This applies only to buffered files, of course.
msg76576 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-11-29 01:42
I agree with Amuary, write() traditionally never writes less data unless
the underlying IO is in nonblocking mode.

I'm working up a new patch to write to self.raw in max_buffer_size
chunks with as few data copies as possible, raising an appropriate
BlockingIOError (as the surrounding code does) when the IO on the
underlying self.raw.write raises one through self._flush_unlocked().
msg76581 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-11-29 02:38
Okay, here's a new patch that obeys the blocking vs nonblocking
semantics properly.  It still needs explicit unit tests for proper behavior.
msg76590 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-11-29 12:49
According to msg76581, the patch fails review (lacking test cases), so
I'm removing the "needs review" keyword.
msg76591 - (view) Author: David M. Beazley (beazley) Date: 2008-11-29 13:24
I agree with previous comments that write() should definitely write all 
data when in blocking mode.
msg80234 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-01-20 05:07
Updated patch with unit tests to verify the write behavior attached in
-gps04.  Up for code review at http://codereview.appspot.com/12470/show
msg80292 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-20 23:27
Hi!

that sounds
> like a behavior change.  I'd be fine with removing it for the 3.1/2.7
version of
> this code (though I hope people will be using the C implementation
instead).

Well, either it's supported and it will have to go through a deprecation
phase, or it's unsupported and it can be ripped out right now...

I don't think it should be supported at all, given that the semantics of
writing an iterable of ints are totally non-obvious. Reading both the
PEP and the docstrings in io.py, I only see mentions of "bytes" and
"buffer", not of an iterable of ints. Perhaps Guido should pronounce.

(do you know of any code relying on this behaviour? the C version
obviously does not support it and all regression tests pass fine, except
for an SSL bug I filed)

http://codereview.appspot.com/12470
msg80299 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-01-21 00:18
I know of no code relying on this behavior, I merely duplicated the 
behavior that existed while writing the patch.  I'm happy to simplify.  
Adding Guido to the CC list for a pronouncement.

Guido is it okay to change io.BufferedWriter.write() in release30-maint 
to no longer accept iterables that do not support len() and things that 
do not support the buffer API?

Antoine already notes in the codereview that other existing io classes 
such as FileIO currently do not support non bytes/buffer or buffer-api 
supporting inputs:  http://codereview.appspot.com/12470/patch/1/3
msg80304 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-01-21 01:01
@Gregory, that sounds like an odd enough use case to skip.  However you
might want to look for __length_hint__ before giving up?

OTOH unless the use case is real, why not support it but making it slow?
msg80902 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-01 06:29
I've uploaded a new patch set with more extensive unit tests.  It also
handles the case of writing array.array objects (or anything with a
memoryview itemsize > 1).  The previous code would buffer by item rather
than by byte.  it has been updated in codereview as well.
msg80903 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-01 06:30
fwiw, I decided Guido and Antoine were right and took out the support
for input that did not support len() to keep things a bit simpler.
msg93792 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-10-09 14:04
max_buffer_size is no longer used, so this issue is obsolete ;)
History
Date User Action Args
2009-10-09 14:04:56pitrousetstatus: open -> closed
resolution: out of date
messages: + msg93792
2009-02-01 06:30:55gregory.p.smithsetmessages: + msg80903
2009-02-01 06:29:49gregory.p.smithsetmessages: + msg80902
2009-02-01 06:26:49gregory.p.smithsetfiles: + issue4428-io-bufwrite-gps05.diff
2009-01-21 01:01:21gvanrossumsetmessages: + msg80304
2009-01-21 00:18:11gregory.p.smithsetnosy: + gvanrossum
messages: + msg80299
2009-01-20 23:27:50pitrousetnosy: + pitrou
messages: + msg80292
title: io.BufferedWriter does not observe buffer size limits -> make io.BufferedWriter observe max_buffer_size limits
2009-01-20 05:07:20gregory.p.smithsetfiles: - issue4428-io-bufwrite-gps03.diff
2009-01-20 05:07:06gregory.p.smithsetkeywords: + needs review
messages: + msg80234
2009-01-20 05:03:51gregory.p.smithsetfiles: + issue4428-io-bufwrite-gps04.diff
2009-01-20 04:16:55gregory.p.smithsetfiles: - issue4428-io-bufwrite-gps02.diff
2008-11-29 13:24:50beazleysetmessages: + msg76591
2008-11-29 12:49:10loewissetkeywords: - needs review
nosy: + loewis
messages: + msg76590
2008-11-29 02:38:52gregory.p.smithsetfiles: - issue4428-io-bufwrite-gps01.diff
2008-11-29 02:38:27gregory.p.smithsetfiles: + issue4428-io-bufwrite-gps03.diff
messages: + msg76581
2008-11-29 01:42:47gregory.p.smithsetmessages: + msg76576
2008-11-29 01:15:44amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg76574
2008-11-29 01:01:37gregory.p.smithsetkeywords: + needs review
files: + issue4428-io-bufwrite-gps02.diff
messages: + msg76573
2008-11-29 00:54:29gregory.p.smithsetfiles: + issue4428-io-bufwrite-gps01.diff
keywords: + patch
messages: + msg76571
2008-11-29 00:13:00gregory.p.smithsetpriority: normal
assignee: gregory.p.smith
nosy: + gregory.p.smith
2008-11-25 12:34:59beazleycreate