Issue 36050: Why does http.client.HTTPResponse._safe_read use MAXAMOUNT

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/80231

classification

Title:	Why does http.client.HTTPResponse._safe_read use MAXAMOUNT
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	bmerry, lukasz.langa, martin.panter, methane
Priority:	normal	Keywords:	patch

Created on 2019-02-20 12:55 by bmerry, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 12698	merged	methane, 2019-04-05 12:32

Messages (11)
msg336081 - (view)	Author: Bruce Merry (bmerry) *	Date: 2019-02-20 12:55
While investigating poor HTTP read performance I discovered that reading all the data from a response with a content-length goes via _safe_read, which in turn reads in chunks of at most MAXAMOUNT (1MB) before stitching them together with b"".join. This can really hurt performance for responses larger than MAXAMOUNT, because (a) the data has to be copied an additional time; and (b) the join operation doesn't drop the GIL, so this limits multi-threaded scaling. I'm struggling to see any advantage in doing this chunking - it's not saving memory either (in fact it is wasting it). To give an idea of the performance impact, changing MAXAMOUNT to a very large value made a multithreaded test of mine go from 800MB/s to 2.5GB/s (which is limited by the network speed).
msg336498 - (view)	Author: Martin Panter (martin.panter) *	Date: 2019-02-25 05:15
The 1 MiB limit was added for Issue 1296004; apparently some platforms were overallocating multiple buffers and running out of memory. I suspect the loop in "_safe_read" was inherited from Python 2, which has different kinds of file objects. The comments suggest it does partial reads. But the Python 3 code calls "socket.makefile" with "buffering" mode enabled by default. I agree it should be safe to read the total length in one go.
msg339496 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-05 11:57
issue1296004 is too old (512MB RAM machine!) and I can not confirm it. But I think it was caused by inefficient realloc() in CRT. See https://github.com/python/cpython/blob/6c52d76db8b1cb836c136bd6a1044e85bfe8241e/Lib/socket.py#L298-L303 _fileobject called socket.recv with remaining size. Typically, socket can't return MBs at once. So it cause: 1. Large (at most `amt`, some MBs) string (bytes) are allocated. (malloc) 2. recv is called. 3. _PyString_Resize() (realloc) is called with smaller bytes (typically ~128KiB) 4. amt -= received 5. if amt == 0: exit; goto 1. This might stress malloc and realloc in CRT. It caused fragmentation and MemoryError. --- For now, almost everything is rewritten. In case of _pyio, BufferedIOReader calls RawIOBase.read(). It copies from bytearray to bytes. So call only malloc and free. Stress for realloc will be reduced. https://github.com/python/cpython/blob/50866e9ed3e4e0ebb60c20c3483a8df424c02722/Lib/_pyio.py#L586-L591 In case of C _io module, it is more efficient. It allocate PyBytes once and calls SocketIO.read_into directly. No temporary bytes objects are created. https://github.com/python/cpython/blob/50866e9ed3e4e0ebb60c20c3483a8df424c02722/Modules/_io/bufferedio.c#L1632 https://github.com/python/cpython/blob/50866e9ed3e4e0ebb60c20c3483a8df424c02722/Modules/_io/bufferedio.c#L1658 https://github.com/python/cpython/blob/50866e9ed3e4e0ebb60c20c3483a8df424c02722/Modules/_io/bufferedio.c#L1470
msg339498 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-05 12:12
Additionally, _safe_read calls fp.read() multiple times to handle EINTR. But EINTR is handled by socket module now (PEP 475). Now the function can be very simple.
msg339532 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-06 09:06
New changeset d6bf6f2d0c83f0c64ce86e7b9340278627798090 by Inada Naoki in branch 'master': bpo-36050: optimize HTTPResponse.read() (GH-12698) https://github.com/python/cpython/commit/d6bf6f2d0c83f0c64ce86e7b9340278627798090
msg398376 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2021-07-28 13:27
New changeset 153365d864c411f6fb523efa752ccb3497d815ca by Inada Naoki in branch '3.9': [3.9] bpo-42853: Fix http.client fails to download >2GiB data over TLS (GH-27405) https://github.com/python/cpython/commit/153365d864c411f6fb523efa752ccb3497d815ca
msg398387 - (view)	Author: Bruce Merry (bmerry) *	Date: 2021-07-28 14:11
Re-opening because the patch to fix this has just been reverted due to bpo-42853.
msg398389 - (view)	Author: Inada Naoki (methane) *	Date: 2021-07-28 14:15
Note that it was reverted only in 3.9 branch.
msg398493 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2021-07-29 15:57
Inadasan is right, this is still fixed in 3.10 and 3.11. The 3.9 revert is due to 3.9 supporting OpenSSL older than 1.1.1. 3.10+ requires OpenSSL 1.1.1+ per PEP 644. There is nothing to do here.
msg398495 - (view)	Author: Bruce Merry (bmerry) *	Date: 2021-07-29 16:12
> There is nothing to do here. Will you accept patches to fix this for 3.9? I'm not clear whether the "bug fixes only" status of 3.9 allows for fixing performance regressions.
msg398496 - (view)	Author: Bruce Merry (bmerry) *	Date: 2021-07-29 16:22
> Will you accept patches to fix this for 3.9? I'm not clear whether the "bug fixes only" status of 3.9 allows for fixing performance regressions. Never mind, I see your already answered this on bpo-42853 (as a no). Thanks for taking the time to answer my questions; I'll just have to skip Python 3.9 for this particular application and go straight to 3.10.

History
Date	User	Action	Args
2022-04-11 14:59:11	admin	set	github: 80231
2021-07-29 16:22:53	bmerry	set	messages: + msg398496
2021-07-29 16:12:33	bmerry	set	messages: + msg398495
2021-07-29 15:57:11	lukasz.langa	set	status: open -> closed resolution: fixed messages: + msg398493 stage: needs patch -> resolved
2021-07-28 14:40:51	gregory.p.smith	set	resolution: fixed -> (no value) stage: resolved -> needs patch
2021-07-28 14:15:55	methane	set	messages: + msg398389
2021-07-28 14:11:08	bmerry	set	status: closed -> open messages: + msg398387
2021-07-28 13:27:57	lukasz.langa	set	nosy: + lukasz.langa messages: + msg398376
2019-04-06 09:06:31	methane	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2019-04-06 09:06:21	methane	set	messages: + msg339532
2019-04-05 12:32:44	methane	set	keywords: + patch stage: patch review pull_requests: + pull_request12623
2019-04-05 12:12:46	methane	set	messages: + msg339498
2019-04-05 11:57:37	methane	set	type: performance messages: + msg339496 versions: + Python 3.8, - Python 3.7
2019-04-05 10:51:14	methane	set	nosy: + methane
2019-02-25 05:15:49	martin.panter	set	nosy: + martin.panter messages: + msg336498
2019-02-20 12:55:43	bmerry	create