Issue 29353: Incorrect handling of HTTP response with "Content-Type: message/rfc822" header

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73539

classification

Title:	Incorrect handling of HTTP response with "Content-Type: message/rfc822" header
Type:		Stage:	patch review
Components:	email	Versions:	Python 3.7, Python 3.6, Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, brokenenglish, martin.panter, orsenthil, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2017-01-23 18:58 by brokenenglish, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
rfc822_httpmessage_payload.patch	brokenenglish, 2017-01-23 18:58		review

Pull Requests
URL	Status	Linked	Edit
PR 12214		cschmidbauer, 2019-09-14 18:57

Messages (4)
msg286106 - (view)	Author: (brokenenglish)	Date: 2017-01-23 18:58
Hello. I found a bug that causes incorrect handling of some values of "Content-Type" header. When you retrieve any URL with "Content-Type: message/rfc822" header, additional payload is added to HTTPMessage. In some cases it causes annoing warnings. Here is a code from packages requests and urllib3 that can cause warning output: 1. Checking and raising excpection in urllib3: https://github.com/shazow/urllib3/blob/0fb5e083b2adf7618db8c26e8e50206de09dd845/urllib3/util/response.py#L61-L66 2. The same code packaged in requests: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/util/response.py#L61-L66 3. Logging the error to output: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/connectionpool.py#L402-L407 The issue arises from the class email.feedparser.FeedParser that handles HTTP and email headers in the same way. So when it handles headers with the message with content-type "message/*" and some other, method _parsegen() tries to parse it like a message with another message (see comments to this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l295). If headers-only argument is set to True, we can avoid some redundant checks on HTTP headers (see this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l244). I read the comment above (https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l241) but I am a newbie and I don't see any other solution to my problem for now. I can rewrite the patch if you show me a better way to fix this issue. A patch with the unit test is atached.
msg286110 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-01-23 19:44
IMO http should be using headersonly=True, so while I haven't looked into this issue the solution seems plausible to me. I'm not entirely sure why it is considered a backward-compatibility hack, but I don't see any likelyhood that the headersonly API will go way.
msg286126 - (view)	Author: Martin Panter (martin.panter) *	Date: 2017-01-23 23:56
There is an inconsistency when parsing with headersonly=True. According to the documentation, get_payload() with message/rfc822 should should return a list of Message objects, not a string. But using headersonly=True produces a non-multipart Message object: >>> m = Parser().parsestr("Content-Type: message/rfc822\r\n\r\n", headersonly=True) >>> m.get_content_type() 'message/rfc822' >>> m.is_multipart() # Doc says True False >>> m.get_payload() # Doc says list of Message objects '' Related to this, setting headersonly=True can also cause a internal inconsistency. Maybe this is why it was called a “hack”: >>> Parser().parsestr("Content-Type: message/delivery-status\r\nInvalid line\r\n\r\n", headersonly=True).as_string() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.5/email/message.py", line 159, in as_string g.flatten(self, unixfrom=unixfrom) File "/usr/lib/python3.5/email/generator.py", line 115, in flatten self._write(msg) File "/usr/lib/python3.5/email/generator.py", line 181, in _write self._dispatch(msg) File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch meth(msg) File "/usr/lib/python3.5/email/generator.py", line 331, in _handle_message_delivery_status g.flatten(part, unixfrom=False, linesep=self._NL) File "/usr/lib/python3.5/email/generator.py", line 106, in flatten old_msg_policy = msg.policy AttributeError: 'str' object has no attribute 'policy' I think it may be best only change get_payload() to return a string in the next Python version (3.7), with appropriate documentation updates. For existing Python versions, perhaps urllib3 could check if the list returned by get_payload() only has trivial empty Message objects (no header fields and only empty payloads themselves). If we agree that only a feature change for 3.7 is appropriate, there are other problems with the current parsing of HTTP headers that could also be looked at: * Only a blank line should end a header section (Issue 24363, Issue 26686) * “From” line should be a defect * Use “email” package’s HTTP parsing policy * Don’t assume Latin-1 encoding (Issue 27716) * Avoid double-handling (header lines are parsed in http.client, then joined together and parsed again in email.feedparser)
msg291176 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-04-05 11:36
I'm not surprised that trying to render a message parsed with 'headersonly' fails. headersonly treats the entire message body as a single string payload. I'm not sure what the correct behavior should be for the email package, but the fact that this doesn't work currently shouldn't matter to the http package. Given the error involves policy, it is quite possible as_string used to work in this context and I broke it when introducing policy because there is no test that covers this case. I'm guessing there is no test because turning a message parsed with headersonly back into a string wasn't considered to be useful. As for the multipart thing, that is actually consistent with the docs. get_content_type gives you the value of the header, is_multipart effectively reports whether or not the payload is a list, and when parsed with headersonly, the body is a string not a list, which is documented. Granted, you have to connect those dots yourself, so a sentence about that could be added to the headersonly docs. All of which should be in a separate issue from this one.

History
Date	User	Action	Args
2022-04-11 14:58:42	admin	set	github: 73539
2019-09-14 18:57:39	cschmidbauer	set	stage: patch review pull_requests: + pull_request15760
2017-04-07 02:14:58	orsenthil	set	nosy: + orsenthil
2017-04-05 11:36:15	r.david.murray	set	messages: + msg291176
2017-04-05 08:24:53	martin.panter	link	issue29991 superseder
2017-01-23 23:56:05	martin.panter	set	versions: - Python 3.3, Python 3.4 nosy: + barry, martin.panter messages: + msg286126 components: + email
2017-01-23 19:44:41	r.david.murray	set	nosy: + r.david.murray messages: + msg286110
2017-01-23 18:58:25	brokenenglish	create