Issue 508157: urllib.urlopen results.readline is slow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/35974

classification

Title:	urllib.urlopen results.readline is slow
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	gstein	Nosy List:	ajaksu2, akuchling, gstein, gvanrossum, kbdavidson, nobody, reacocard
Priority:	normal	Keywords:

Created on 2002-01-24 21:48 by kbdavidson, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Messages (9)
msg8975 - (view)	Author: Keith Davidson (kbdavidson)	Date: 2002-01-24 21:48
The socket file object underlying the return from urllib.urlopen() is opened without any buffering resulting in very slow performance of results.readline (). The specific problem is in the httplib.HTTPResponse constructor. It calls sock.makefile() with a 0 for the buffer size. Forcing the buffer size to 4096 results in the time for calling readline() on a 60K character line to go from 16 seconds to .27 seconds (there is other processing going on here but the magnitude of the difference is correct). I am using Python 2.0 so I can not submit a patch easily but the problem appears to still be present in the 2.2 source. The specific change is to change the 0 in sock.makefile() to 4096 or some other reasonable buffer size: class HTTPResponse: def __init__(self, sock, debuglevel=0): self.fp = sock.makefile('rb', 0) <= change to 4096 self.debuglevel = debuglevel
msg8976 - (view)	Author: Nobody/Anonymous (nobody)	Date: 2002-01-24 21:54
Logged In: NO What platform? --Guido (not logged in)
msg8977 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-01-25 14:12
Logged In: YES user_id=6380 I wonder why the author explicitly turned off buffering. There probably was a reason? Without knowing why, we can't just change it.
msg8978 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2002-03-14 23:32
Logged In: YES user_id=11375 Greg Stein originally wrote it; I'll ping him. I suspect it might be because of HTTP pipelining; if multiple responses will be returned over a socket, you probably can't use buffering because the buffer might consume the end of response #1 and the start of response #2.
msg8979 - (view)	Author: Greg Stein (gstein) *	Date: 2002-03-18 07:05
Logged In: YES user_id=6501 Andrew is correct. The buffering was turned off (specifically) so that the reading of one response will not consume a portion of the next response. Jeremy first found the over-reading problem a couple years ago, and we solved the problem then. To read the thread: http://mail.python.org/pipermail/python-dev/2000-June/004409.html After the HTTP response's headers have been read, then it can be determined whether the connection will be closed at the end of the response, or whether it will stay open for more requests to be performed. If it is going to be closed, then it is possible to use buffering. Of course, that is after the headers, so you'd actually need to do a second dup/makefile and turn on buffering. This also means that you wouldn't get the buffering benefits while reading headers. It could be possible to redesign the connection/response classes to keep a buffer in the connection object, but that is quite a bit more involved. It also complicates the passing of the socket to the response object in some cases. I'm going to close this as "invalid" since the proposed fix would break the code.
msg65019 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2008-04-06 04:46
Well, this issue is still hurting performance, the most recent example was with a developer of a download manager. I suggest adding a buffer size argument to HTTPResponse.__init__ (defaulting to zero), along with docs that explain the problems that may arise from using a buffer. If there's any chance this might be accepted, I'll write a patch.
msg65021 - (view)	Author: Aren Olson (reacocard)	Date: 2008-04-06 06:07
I can indeed confirm that this change creates a HUGE speed difference. Using the code found at [1] with python2.5 and apache2 under Ubuntu, changing the buffer size to 4096 improved the time needed to download 10MB from 15.5s to 1.78s, almost 9x faster. Repeat downloads of the same file (meaning the server now has the file cached in memory), yield times of 15.5s and 0.03s, a 500x improvement. When fetching from a server on the local network, rather than from localhost, these times become 15.5s and 0.9s in both cases, a 17x speedup. Real-world situations will likely be a mix of these, however it is safe to say the speed improvement will be substantial. Adding an option to adjust the buffer size would be very welcome, though the default value should still be zero, to avoid the issues already mentioned. [1] - http://pastebin.ca/973578
msg65082 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-04-07 17:53
Please don't add to a closed issue that old. If you still have an issue with this, please open a new issue. If you have a patch, kindly upload it to the issue.
msg65122 - (view)	Author: Aren Olson (reacocard)	Date: 2008-04-07 20:56
new issue: http://bugs.python.org/issue2576

History
Date	User	Action	Args
2022-04-10 16:04:55	admin	set	github: 35974
2008-04-07 20:56:23	reacocard	set	messages: + msg65122
2008-04-07 17:53:18	gvanrossum	set	messages: + msg65082
2008-04-06 06:07:37	reacocard	set	nosy: + reacocard messages: + msg65021
2008-04-06 04:46:54	ajaksu2	set	nosy: + ajaksu2 messages: + msg65019 versions: + Python 2.6, - Python 2.2
2002-01-24 21:48:44	kbdavidson	create