classification
Title: should socket readline() use default_bufsize instead of _rbufsize?
Type: performance Stage:
Components: Versions: Python 3.1, Python 2.6
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ggenellina, gregory.p.smith, gvanrossum, kristjan.jonsson
Priority: normal Keywords:

Created on 2008-11-27 20:57 by gregory.p.smith, last changed 2009-02-10 17:09 by kristjan.jonsson. This issue is now closed.

Messages (8)
msg76516 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-11-27 20:57
....
From Kristján Valur Jónsson (kristjan at ccpgames.com) on python-dev:

 http://mail.python.org/pipermail/python-dev/2008-November/083724.html
....

I came across this in socket.c:

        # _rbufsize is the suggested recv buffer size.  It is *strictly*
        # obeyed within readline() for recv calls.  If it is larger than
        # default_bufsize it will be used for recv calls within read().
       

What I worry about is the readline() case.  Is there a reason why we
want to strictly obey it for that function?  Note that in the
documentation for _fileobject.read() it says:

        # Use max, disallow tiny reads in a loop as they are very
inefficient.

 
The same argument surely applies for readline().

 
The reason I am fretting about this is that httplib.py (and therefore
xmlrpclib.py) specify bufsize=0 when createing their socket fileobjects,
presumably to make sure that write() operations are not buffered but
flushed immediately.  But this has the side effect of setting the
_rbufsize to 1, and so readline() calls become very slow.

 
I suggest that readline() be made to use at least defaultbufsize, like
read().  Any thoughts?
msg76520 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-11-28 04:34
You meant socket.py.

This is an extremely subtle area.  I would be very wary of changing this
-- there is a use case where headers are read from the socket using
readline() but the rest of the data is read directly from the socket,
and this would break if there was buffered data in the file objects. 
This is exactly why httplib sets the buffer size to 0.

Fortunately things are completely different in Python 3.0 and I believe
the same problem doesn't exist -- in 3.0 it makes more sense to always
read from the (binary) buffered file object representing the socket.
msg76522 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2008-11-28 09:35
If you look at http://bugs.python.org/issue4336, half of the proposed 
patch is an attempt to deal with this performance issue.  In the patch, 
we laboriously ensure that bufsize=-1 is passed in for for the xmlrpc 
client.

Seeing your comment, I realize that xmlrpclib.py also uses direct 
access to h._conn.sock (if present) and uses recv() on that.  In fact, 
that is the only place in the standard library where I can find this 
pattern.  Was that a performance improvement?  It is hard to see how 
bypassing buffered read with a manual recv() can significantly alter 
performance.

In all the cases in the test_xmlrpc.py, h._conn.sock is actually None 
because h._conn has been closed in HttpConnection.getresponse()  
Therefore, my patch continues to work.  However, I will fix that patch 
to cater to this strange special case.

However, please observe that since _fileobject.read() calls are always 
buffered, in general there is no way to safely mix read() and recv() 
calls, althought the recv() and readline() has been fudged to work.  
Isn´t this just a case of a wart in the standard lib that we ought to 
remove?

Here is a suggestion:
1) document why readline() observes 0 buffering (to enable it to be 
used as a readline() utility tool on top of vanilla socket recv()
2) stop doing that in xmrlrpclib and use default buffering.
msg76538 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-11-28 15:50
I'm fine with disabling this feature in xmlrpclib.py, and possibly even
in httplib.py.

I'm *not* fine with "fixing" this behavior in socket.py -- the unittest
coverage is unfortunately small and we have had plenty of trouble in
this area in the past.  It is there for a reason, even if that reason is
hard to fathom and poorly documented.

Fortunately in 3.0 it's gone (or, more likely, replaced with a different
set of issues :-).
msg80160 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-01-19 11:58
Hi,
I'm reawakening this because http://bugs.python.org/issue4879 needs to 
be ported to py3k.
In py3k, a socket.fileobject() is still created with bufsize(0), 
although now the reasoning is different:

 def __init__(self, sock, debuglevel=0, strict=0, method=None):
        # XXX If the response includes a content-length header, we
        # need to make sure that the client doesn't read more than the
        # specified number of bytes.  If it does, it will block until
        # the server times out and closes the connection.  (The only
        # applies to HTTP/1.1 connections.)  Since some clients access
        # self.fp directly rather than calling read(), this is a little
        # tricky.
        self.fp = sock.makefile("rb", 0)

I think that this is just a translation of the old comment, i.e. a 
warning that some people may choose to call .recv() on the underlying 
socket.
Now, this should be far more difficult now, with the newfangled IO 
library and all, and since the sock.makefile() is now a SocketIO object 
which inherits from RawIOBase and all that.  It's tricky to excracth 
the socket to do .recv() on it.  So, I don't think we need to fear 
buffering for readline() anymore.

Or, is the comment about someone doing a HTTPResponse.fp.read() in 
stead of a HTTPResponse.read()?  In that case, I don't see the 
problem.  Of course, anyone reading N characters from a socket stream 
may cause blocking.

My proposal is to remove the comment above and use default buffering 
for the fileobject.  Any thoughts?
msg80895 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-02-01 00:33
unassigning, i don't have time to look at this one right now.
msg80945 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-02-02 15:58
I have looked at this for py3k.
the behaviour of HTTPResponse.fp.read() is the same, wheter fp is 
buffered or not:  a read() will read to EOF for HTTP/1.1, which means 
blocking indefinetely.  So, read() is forbidden for HTTP/1.1.  For 
fp.read(n), buffered IO won't attempt to read more than is on the 
stream, if n bytes are avalible (SocketIO.read(N) will return a<N and 
not block) so there is no reason not to use buffering.
msg81566 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-02-10 17:09
Issue 4879 has been resolved so that that HTTPResponse invokes 
socket.socket.makefile() with default buffering.  see r69209.  Since the 
problem stated in this defect has no bearing on 3.0 (there is no special 
hack for readline() in 3.0) I am closing this again.
History
Date User Action Args
2009-02-10 17:09:24kristjan.jonssonsetstatus: open -> closed
messages: + msg81566
2009-02-02 15:58:11kristjan.jonssonsetmessages: + msg80945
2009-02-01 00:33:36gregory.p.smithsetassignee: gregory.p.smith ->
messages: + msg80895
2009-01-19 21:56:45ggenellinasetnosy: + ggenellina
2009-01-19 11:58:13kristjan.jonssonsetmessages: + msg80160
versions: + Python 3.1
2008-11-28 15:50:54gvanrossumsetmessages: + msg76538
2008-11-28 09:35:43kristjan.jonssonsetnosy: + kristjan.jonsson
messages: + msg76522
2008-11-28 04:34:16gvanrossumsetnosy: + gvanrossum
messages: + msg76520
2008-11-27 20:57:10gregory.p.smithcreate