Issue 23350: Content-length is incorrect when request body is a list or tuple

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67539

classification

Title:	Content-length is incorrect when request body is a list or tuple
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Pinz, berker.peksag, demian.brecht, martin.panter, orsenthil, r.david.murray, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2015-01-30 01:12 by demian.brecht, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
list_content_length.patch	demian.brecht, 2015-01-30 01:12		review
list_content_length_1.patch	demian.brecht, 2015-01-30 05:04		review
list_content_length_2.patch	demian.brecht, 2015-02-13 05:50		review
list_content_length_3.patch	demian.brecht, 2015-02-13 15:46		review

Messages (22)
msg235012 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-01-30 01:12
Rather than summing the value of each element of a list or tuple to use as the value of the content-length header, the length of the list or tuple is used.
msg235019 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-01-30 05:04
Updated patch based on review.
msg235022 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-30 05:51
[Edit Error: 'utf8' codec can't decode byte 0xe2 in position 207: invalid continuation byte] The documentation currently says “Content-Length header should be explicitly provided when the body is an iterable”. See Lib/urllib/request.py:1133 for how it is done for urlopen(), using memoryview(), which is probabaly more correct.
msg235068 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-01-31 00:45
On 2015-01-29 9:51 PM, Martin Panter wrote: > The documentation currently says “Content-Length header should be explicitly provided when the body is an iterable”. See Lib/urllib/request.py:1133 for how it is done for urlopen(), using memoryview(), which is probabaly more correct. Sure, entirely disabling computing the content length if the body is an iterable is one way to address it, but I'm not convinced that it's better. If the ability is there to compute the content length, why not do so? The current implementation /should/ be correct whether elements are bytes or strings (the default encoding doesn't allow for multi-byte chars, so len(raw_string) should equal len(encoded_string)) when encoded using the new block of encoding code I added in the patch. Is there something that I'm missing? I could possibly see an argument for performance as you're iterating over each element in the list, but that would be entirely circumvented if the user defines the content length up front.
msg235075 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-31 01:29
Sorry my comment was a bit rushed. I wasn’t saying this feature shouldn’t be added. I guess I was pointing out two things: 1. Someone should updated the documentation to say that Content-Length no longer has to be explicitly provided for lists and tuples. 2. Perhaps you could consider using the same len(memoryview) * memoryview.itemsize technique used in urllib, so that the length of e.g. array.array("I", range(3)) is correct. But that is tangential to what you are trying to achieve, and now I realize coping with Latin-1 encoding at the same time might make it a bit too complicated, so perhaps don’t worry about it :)
msg235876 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-02-13 05:50
Thanks for the clarification Martin. After giving this some further thought, I think that the best way to go is to /only/ calculate and add the Content-Length header if each element in the list or tuple is pre-encoded. If it's mixed or only strings, then there are one of three options: 1. Don't try to compute the Content-Length automatically as encoding may change the number of bytes being sent across as the body (which is what the Content-Length represents) 2. Encode the entire body twice, once during the computation of the Content-Length and again on the fly as the body is being written to the socket. This will incur 2x the CPU cost, which can be costly if the body is sufficiently large. 3. Encode the entire body once and store it in memory. Given body sizes can be quite large, I don't think that duplicating the body would be a good approach. The attached patch uses option 1 in order to not add any CPU or memory cost to the operation, but still fix the Content-Length issue as reported. I've also updated the docs to indicate as much.
msg235879 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-02-13 06:38
The length of an encoded Latin-1 string should equal the length of the unencoded text string, since it is a one-to-one character-to-byte encoding. So encoding should not actually be needed to determine the Latin-1 encoded length. Though I’m not particularly excited by silently Latin-1 encoding text bodies in the first place.
msg235893 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-02-13 15:46
Thanks for the review Martin, I've addressed your comments. > The length of an encoded Latin-1 string should equal the length of the unencoded text string, since it is a one-to-one character-to-byte encoding. Once in a while, I want to stop what I'm doing, put my head in my hands and think to myself "how did that escape me"?! Of course you're right and thanks for the catch. I've reverted the handling to how it was being done in the previous patch. > Though I’m not particularly excited by silently Latin-1 encoding text bodies in the first place. Truth be told, I'm more fond of only accepting pre-encoded byte strings as input. However, that backwards incompatible change would likely break many things. Request bodies can currently be strings, byte strings, iterables or file objects. In the cases of string and file objects, encoding is already supported. The change I made makes handling iterables consistent with the other accepted data types. I'm not sure why, but the auto-encoding of the raw string input object was being done higher up in the general use case callstack (Lib/http/client.py:1064). I've moved this handling to send() for consistency with the auto-encoding of other input types. This also ensures consistent behavior between calling request() with a string body and calling send() directly.
msg235949 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-02-14 04:38
New patch looks good I think. Making the encoding code more consistent is nice.
msg238825 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-21 19:02
There are two issues. The one is that calculated Content-Length is not correct for lists, tuples, and other types (such as deque or array.array). The right solution is to calculate size using a technique used in urllib. Content-Length shouldn't be calculated for lists, tuples, and other non-bytes-compatible sequences. This is a bug, and the patch should be applied to all maintained releases. The second issue is feature request. Allow calculating Content-Length for lists and tuples.
msg238848 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-21 23:05
Technically I don’t think there is a bug. The documentation says [the] “Content-Length header should be explicitly provided”, so if you don’t set it you could argue that you’re using the library wrong. For this issue I think Demian was trying to add support (i.e. new feature) for implicit Content-Length with tuples and lists of bytes (or strings). He has also added support for iterables of Latin-1 encodable text strings. What you are suggesting Serhiy sounds like a separate new feature to support bodies of arbitrary bytes-like objects (or lists or tuples of them). According to the documentation, only byte and Latin-1 text strings, file objects supporting stat(), and iterables are currently supported. It does not say, but before this patch I think the iterables had to be of bytes-like objects.
msg238938 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-03-22 20:30
I just updated the docs to what I think is the current reality. See issue 23740 for what I think are problems with the current implementation, aside from any enhancement of computing a length for tuple or list. Since the latter cannot be done reliably unless we know the list or tuple is all bytes, I propose that we don't do it at all (since I'd like to see iterables of text strings supported).
msg238939 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-03-22 20:34
Well, the current reality not counting the bug reported in this issue. So, I documented it as if the fix here is to not set the length when body is an iterator.
msg238954 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-22 23:35
David: Calculating the length of a list or tuple of Latin-1 text strings should actually be straight-forward and reliable, because it is a single-byte encoding. However I am starting to think adding a new special case for lists and tuples is a bad idea. There are already way too many special cases.
msg238963 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-23 00:40
@Serhiy: > Content-Length shouldn't be calculated for lists, tuples, and other non-bytes-compatible sequences. I'd agree with this if it wasn't relatively trivial to calculate. There's no reason that I can think of to exclude the auto-generated Content-Length header for data types for which the size is known. @Martin: > Technically I don’t think there is a bug. The bug is that the Content-Length header is currently added for bodies that are lists and tuples and the value is incorrect. For example: con.request('POST', '/', ['aaa', 'bbb', 'ccc']) results in Host: example.com Accept-Encoding: identity Content-Length: 3 @David: > Since the latter cannot be done reliably unless we know the list or tuple is all bytes, I propose that we don't do it at all (since I'd like to see iterables of text strings supported). The patch here adds support for iterables of text strings (as well as iterables comprised of both bytes and strings). Content-Length can be computed reliably as the size of a latin1-encoded string will be identical to the original string.
msg239763 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-31 23:42
The computation of Content-Length has also undergone some refactoring as part of #12319. Setting this as pending until #12319 has been accepted or rejected. If rejected, the implementation specific to generating Content-Length should be migrated here.
msg239764 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-31 23:44
If #12319 is accepted, the implementation for Content-Length should also likely be migrated to this issue to be applied to maintenance branches as a bug fix.
msg240563 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-04-12 16:04
See also issue 12327 for length issue using StingIO.
msg240611 - (view)	Author: Vincent Alquier (Pinz)	Date: 2015-04-13 14:28
Another issue should be addressed by patch... When trying to guess the content-length, here is the code you find... try: thelen = str(len(body)) except TypeError as te: [...] The call to `len` will raise a `TypeError` in case of a C file "object". But if body is a python file-like object, it will often raise an `AttributeError`. So, the code should be replaced by (in both python 2.7 and 3): try: thelen = str(len(body)) except (TypeError, AttributeError): [...]
msg240760 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-04-13 21:24
Vincent: That sounds more like a case of Issue 15267, or have you found a way to trigger the AttributeError in Python 3 as well?
msg240840 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-04-14 04:52
Vincent: The logic to determine content length is undergoing a bit of an overhaul as part of #12319, which I'm hoping to wrap up in the next week or so.
msg240917 - (view)	Author: Vincent Alquier (Pinz)	Date: 2015-04-14 15:48
Martin: You're right, it's the same issue, and only related to python2's old style classes. Sorry for the useless noise. Demian: My problem is `len(obj)` raises an Using AttributeError in python2 (with obj being old style class instance). It's python 2.X specific and I don't think it is going to be addressed (reading comments in #15267). But thanks for your code rework.

History
Date	User	Action	Args
2022-04-11 14:58:12	admin	set	github: 67539
2015-04-14 15:48:37	Pinz	set	messages: + msg240917
2015-04-14 04:52:00	demian.brecht	set	messages: + msg240840
2015-04-13 21:24:03	martin.panter	set	messages: + msg240760
2015-04-13 14:28:52	Pinz	set	nosy: + Pinz messages: + msg240611
2015-04-12 16:04:43	r.david.murray	set	messages: + msg240563
2015-04-12 16:03:07	r.david.murray	link	issue12327 superseder
2015-03-31 23:44:40	demian.brecht	set	status: pending -> open messages: + msg239764
2015-03-31 23:42:59	demian.brecht	set	status: open -> pending messages: + msg239763
2015-03-23 00:40:17	demian.brecht	set	messages: + msg238963
2015-03-22 23:35:10	martin.panter	set	messages: + msg238954
2015-03-22 20:34:48	r.david.murray	set	messages: + msg238939
2015-03-22 20:30:47	r.david.murray	set	nosy: + r.david.murray messages: + msg238938
2015-03-21 23:05:46	martin.panter	set	messages: + msg238848
2015-03-21 19:02:35	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg238825
2015-02-20 12:22:31	berker.peksag	set	nosy: + berker.peksag stage: patch review
2015-02-14 04:38:22	martin.panter	set	messages: + msg235949
2015-02-13 15:46:57	demian.brecht	set	files: + list_content_length_3.patch messages: + msg235893
2015-02-13 06:38:30	martin.panter	set	messages: + msg235879
2015-02-13 05:50:55	demian.brecht	set	files: + list_content_length_2.patch messages: + msg235876
2015-01-31 01:29:17	martin.panter	set	messages: + msg235075
2015-01-31 00:45:11	demian.brecht	set	messages: + msg235068
2015-01-30 05:51:20	martin.panter	set	messages: + msg235022
2015-01-30 05:11:07	martin.panter	set	nosy: + martin.panter
2015-01-30 05:04:36	demian.brecht	set	files: + list_content_length_1.patch messages: + msg235019
2015-01-30 01:23:52	pitrou	set	nosy: + orsenthil
2015-01-30 01:13:13	demian.brecht	set	type: behavior components: + Library (Lib) versions: + Python 3.5
2015-01-30 01:12:42	demian.brecht	create