This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Incorrect handling of HTTP response with "Content-Type: message/rfc822" header
Type: Stage: patch review
Components: email Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, brokenenglish, martin.panter, orsenthil, r.david.murray
Priority: normal Keywords: patch

Created on 2017-01-23 18:58 by brokenenglish, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
rfc822_httpmessage_payload.patch brokenenglish, 2017-01-23 18:58 review
Pull Requests
URL Status Linked Edit
PR 12214 cschmidbauer, 2019-09-14 18:57
Messages (4)
msg286106 - (view) Author: (brokenenglish) Date: 2017-01-23 18:58
Hello.

I found a bug that causes incorrect handling of some values of "Content-Type" header. When you retrieve any URL with "Content-Type: message/rfc822" header, additional payload is added to HTTPMessage. In some cases it causes annoing warnings. Here is a code from packages requests and urllib3 that can cause warning output:
1. Checking and raising excpection in urllib3: https://github.com/shazow/urllib3/blob/0fb5e083b2adf7618db8c26e8e50206de09dd845/urllib3/util/response.py#L61-L66
2. The same code packaged in requests: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/util/response.py#L61-L66
3. Logging the error to output: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/connectionpool.py#L402-L407

The issue arises from the class email.feedparser.FeedParser that handles HTTP and email headers in the same way. So when it handles headers with the message with content-type "message/*" and some other, method _parsegen() tries to parse it like a message with another message (see comments to this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l295). If headers-only argument is set to True, we can avoid some redundant checks on HTTP headers (see this condition: https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l244).

I read the comment above (https://hg.python.org/cpython/file/tip/Lib/email/feedparser.py#l241) but I am a newbie and I don't see any other solution to my problem for now. I can rewrite the patch if you show me a better way to fix this issue.

A patch with the unit test is atached.
msg286110 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-01-23 19:44
IMO http *should* be using headersonly=True, so while I haven't looked into this issue the solution seems plausible to me.  I'm not entirely sure why it is considered a backward-compatibility hack, but I don't see any likelyhood that the headersonly API will go way.
msg286126 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-01-23 23:56
There is an inconsistency when parsing with headersonly=True. According to the documentation, get_payload() with message/rfc822 should should return a list of Message objects, not a string. But using headersonly=True produces a non-multipart Message object:

>>> m = Parser().parsestr("Content-Type: message/rfc822\r\n\r\n", headersonly=True)
>>> m.get_content_type()
'message/rfc822'
>>> m.is_multipart()  # Doc says True
False
>>> m.get_payload()  # Doc says list of Message objects
''

Related to this, setting headersonly=True can also cause a internal inconsistency. Maybe this is why it was called a “hack”:

>>> Parser().parsestr("Content-Type: message/delivery-status\r\nInvalid line\r\n\r\n", headersonly=True).as_string()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/email/message.py", line 159, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib/python3.5/email/generator.py", line 115, in flatten
    self._write(msg)
  File "/usr/lib/python3.5/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.5/email/generator.py", line 214, in _dispatch
    meth(msg)
  File "/usr/lib/python3.5/email/generator.py", line 331, in _handle_message_delivery_status
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "/usr/lib/python3.5/email/generator.py", line 106, in flatten
    old_msg_policy = msg.policy
AttributeError: 'str' object has no attribute 'policy'

I think it may be best only change get_payload() to return a string in the next Python version (3.7), with appropriate documentation updates. For existing Python versions, perhaps urllib3 could check if the list returned by get_payload() only has trivial empty Message objects (no header fields and only empty payloads themselves).

If we agree that only a feature change for 3.7 is appropriate, there are other problems with the current parsing of HTTP headers that could also be looked at:

* Only a blank line should end a header section (Issue 24363, Issue 26686)
* “From” line should be a defect
* Use “email” package’s HTTP parsing policy
* Don’t assume Latin-1 encoding (Issue 27716)
* Avoid double-handling (header lines are parsed in http.client, then joined together and parsed again in email.feedparser)
msg291176 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-04-05 11:36
I'm not surprised that trying to render a message parsed with 'headersonly' fails.  headersonly treats the entire message body as a single string payload.  I'm not sure what the correct behavior should be for the email package, but the fact that this doesn't work currently shouldn't matter to the http package.  Given the error involves policy, it is quite possible as_string used to work in this context and I broke it when introducing policy because there is no test that covers this case.  I'm guessing there is no test because turning a message parsed with headersonly back into a string wasn't considered to be useful.

As for the multipart thing, that is actually consistent with the docs.  get_content_type gives you the value of the header, is_multipart effectively reports whether or not the payload is a list, and when parsed with headersonly, the body is a string not a list, which is documented.  Granted, you have to connect those dots yourself, so a sentence about that *could* be added to the headersonly docs.

All of which should be in a separate issue from this one.
History
Date User Action Args
2022-04-11 14:58:42adminsetgithub: 73539
2019-09-14 18:57:39cschmidbauersetstage: patch review
pull_requests: + pull_request15760
2017-04-07 02:14:58orsenthilsetnosy: + orsenthil
2017-04-05 11:36:15r.david.murraysetmessages: + msg291176
2017-04-05 08:24:53martin.panterlinkissue29991 superseder
2017-01-23 23:56:05martin.pantersetversions: - Python 3.3, Python 3.4
nosy: + barry, martin.panter

messages: + msg286126

components: + email
2017-01-23 19:44:41r.david.murraysetnosy: + r.david.murray
messages: + msg286110
2017-01-23 18:58:25brokenenglishcreate