classification
Title: httplib fails to handle semivalid HTTP headers
Type: behavior Stage: patch review
Components: email, Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Lukasa, barry, demian.brecht, gboudreau, icordasc, martin.panter, mgdelmonte, orsenthil, piotr.dobrogost, python-dev, r.david.murray
Priority: normal Keywords: easy, patch

Created on 2015-06-02 15:06 by mgdelmonte, last changed 2017-02-07 15:12 by r.david.murray.

Files
File name Uploaded Description Edit
bypass-parsegen.patch martin.panter, 2016-08-12 12:37 Python 3 review
bypass-parsegen.v2.patch martin.panter, 2016-09-09 03:47 review
policy-flag.patch martin.panter, 2016-09-10 06:49 review
header.py2.patch martin.panter, 2016-09-10 09:38 review
policy-flag.v2.patch martin.panter, 2017-01-24 05:27 review
Messages (34)
msg244672 - (view) Author: Michael Del Monte (mgdelmonte) Date: 2015-06-02 15:06
Initially reported at https://github.com/kennethreitz/requests/issues/2622

Closely related to http://bugs.python.org/issue19996

An HTTP response with an invalid header line that contains non-blank characters but *no* colon (contrast http://bugs.python.org/issue19996 in which it contained a colon as the first character) causes the same behavior.

httplib.HTTPMessage.readheaders() oddly does not appear even to attempt to follow RFC 2616, which requires the header to terminate with a blank line.  The invalid header line, which admittedly also breaks RFC 2616, is at least non-blank and should not terminate the header.  Yet readheaders() takes it as an indicator that the header is over and then fails properly to process the rest of the response.

The problem is exacerbated by a chunked encoding, which will not be properly received if the encoding header is not seen because readheaders() terminates early.  An example (why are banks always the miscreants here?) is:

p = response.get("http://www.merrickbank.com/")

My recommended fix would be to insert these lines at httplib:327

                # continue reading headers on non-blank lines
                elif not len(line.strip()):
                    continue
                # break only on blank lines


This would cause readheaders() to terminate only on a non-blank non-header non-comment line, in accordance with RFC 2616.
msg244673 - (view) Author: Ian Cordasco (icordasc) * Date: 2015-06-02 15:22
FWIW, the proper section to reference now is 3.2 in RFC 7230 (https://tools.ietf.org/html/rfc7230#section-3.2)
msg244674 - (view) Author: Ian Cordasco (icordasc) * Date: 2015-06-02 15:24
Also I'm marking this as affecting 3.3, 3.4, and 3.5. I haven't tested against 3.5, but it definitely fails on 3.4. I hope to be able to test against 3.5.0b2 tonight
msg244675 - (view) Author: Michael Del Monte (mgdelmonte) Date: 2015-06-02 15:47
Thanks.  Also I meant to have said, "...to terminate only on a *blank* non-header non-comment line, in accordance with RFC 2616 (and 7230)."

I note that the RFCs require CRLF to terminate but in my experience you can get all manner of "blank" lines, so accepting len(line.strip())==0 is going to accommodate servers that give CRCR or LFLF or (and I have seen this) LFCRLF.
msg244676 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-06-02 16:08
The current behavior probably comes out of the RFC822 world, where when seeing a line that doesn't look like a header the error recovery is to treat that line as the beginning of the body (ie: assume the blank line is missing).

Is there in fact any guidance on what the optimal error recovery strategy is for http?  I've actually considered trying to make the error recovery in email smarter, by seeing if the *next* line looks like a header before making the decision.
msg244678 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-06-02 16:28
Ah, in fact that's exactly where it comes from, since httplib uses the email header parsing code.  In python3 we are actually using the email package to parse the headers (which is sensible) (in 2.7 it is a copy of code from the old mimelib with some tweaks).  Fixing this in python3 is best done by making the error recovery enhancement(s) I mentioned in the email package.  

Please note that changing this behavior has the potential to break working code.  That is, just as we have here a server that is out of spec and sending invalid links in the middle of headers, we may have a server that is out of spec by not sending the blank delimiter, which is currently being handled "correctly".  Thus I don't think your simple fix is advisable, instead I think we should pursue the more complicated "look ahead" fix.
msg244703 - (view) Author: Cory Benfield (Lukasa) * Date: 2015-06-02 22:11
This is one of those bugs that's actually super tricky to correctly fix.

The correct path is to have the goal of conservatively accepting as many headers as possible. Probably this means looking ahead to the next few lines and seeing if they appear to roughly keep a header structure (\r\n line breaks and colon separated values). However, I'm not entirely sure how to structure that sentiment in code at this time.
msg244704 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-06-02 22:22
I think there may be a way to accomplish this in a reasonably straightforward fashion in python3 given that feedparser has an 'unreadline' function.  The python2 case is probably going to be a more complicated change.  And I agree that multiple lines should be examined.

There is also the question of what to do with the extra data.  I think the correct approach is to treat a line that does not conform as a line that is missing the leading whitespace that would make it a continuation line.  It looks like that would certainly be correct for the case in hand.
msg244719 - (view) Author: Michael Del Monte (mgdelmonte) Date: 2015-06-03 00:16
I don't want to speak out of school and you guys certainly know what you're doing, but it seems a shame to go through these gyrations -- lookahead plus "unreading" lines -- only to preserve the ability to parse email headers, when HTTP really does follow a different spec.  My suggestion would be to examine the header and decide, if it's HTTP, to just ignore nonconforming lines; and if it's email, then the problem is already solved (as email doesn't have encoding rules that would cause problems later).  

My fear would be that you'll eventually get that nonconforming line with leading whitespace, which will lead right back to the same error.
msg244720 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-06-03 00:48
No, the point is to do "best practical" error recovery when faced with dirty data that may be dirty in various ways, and it doesn't really matter whether it is http headers or email headers.  A line with leading whitespace is treated as part of the preceding header line now, and this is the way it should behave, since the older http standards adopted that behavior from rfc822.  You will note that the standard referenced by Ian is explicit about that, in the obs-fold clause.  That is, we are required by the standard to support that behavior, which is why I posit that the best recovery is to assume an invalid line followed by what look like headers is in fact an incorrectly folded obs-fold continuation line.  That this will also conform to the email standard is a not-accidental consequence of how these standards evolved.  (That is, email and http header handling are *not* "different" specs in the sense of being disjoint, they are derivatives of a common ancestor spec and some effort is spent keeping them interoperable, to my understanding.)
msg244721 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-06-03 00:58
Regarding the suggested fix for Python 2, make sure it does not prematurely end the parsing on empty folded lines (having only tabs and spaces in them). E.g. according to RFC 7230 this should be a single header field:

b"Header: obsolete but\r\n"
b"    \r\n"
b"    still valid\r\n"

I suspect the RFC doesn’t say anything specifically about this case. In general the guidance seems to be things like:

* User agents should be tolerant of errors received in the protocol
* Proxies should fix up errors when forwarding messages upstream
* Servers should often reject errors in requests with 400 Bad Request (presumably to avoid the possibility of a downstream proxy being tricked by the protocol error and not triggering some security filter)

In the case of the bank web site, the last lines of the header are:

X-Frame-Options: SAMEORIGIN\r\n
Set-Cookie: mb-CookieP=; HttpOnly; \r\n
Secure\r\n
Set-Cookie: mb-CookieP=; HttpOnly; Secure\r\n
\r\n

It is obvious that this case could be treated as a folded (continuation) line. But in general I think it would be better to ignore the erroneous line, or to record it as a defect so that the server module or other user can check it.

Looking at the Python 3 code, both the client and server call http.client._parse_headers(), which sensibly reads each line until it sees a blank line (Lib/http/client.py:197). But then after jumping through some more hoops we parse it again until we find a line that fails the regular expression at Lib/email/feedparser.py:37. The remaining lines become the “payload” of the HTTP header:

>>> with urlopen("http://www.merrickbank.com/") as response:
...     response.info().get_payload()
... 
'Secure\r\nSet-Cookie: mb-CookieP=; HttpOnly; Secure\r\n\r\n'

What might be nice is a way to reuse the email header field parsing code, without worrying about the “From” line stuff, or the payload stuff.
msg244722 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-06-03 01:25
Since the email package has the correct logic for handling the blank continuation line case (even in Python2) (because, again, that derives from the original email standard), it might be reasonable to use feedparser's headersonly mode.  If necessary we can introduce a (private) variation that actually stops parsing the stream at the end of the headers.  (This would be a non-public API on feedparser for 2.7 and 3.4/5, and a public one in 3.6.)  Since feedparser was itself a rewrite of the mimetools code httplib is using and theoretically backward compatible with it in behavior, this should not in theory introduce any behavior changes (famous last words?).  

The problem here is the good chance that someone is depending on the internal implementation of HTTPMessage.  So, I'm not at all sure about this suggestion.
 
And yes, the email package in python3 records header defects.  However, I think it is better to treat the no-leading-blank non-header as an incorrectly folded continuation line rather than discarding the data.  The data is in the headers section for *some* reason, and IMO that is the only reasonable guess as to why.
msg244732 - (view) Author: Cory Benfield (Lukasa) * Date: 2015-06-03 08:21
> It is obvious that this case could be treated as a folded (continuation) line. But in general I think it would be better to ignore the erroneous line, or to record it as a defect so that the server module or other user can check it.

Just to clarify, in an instance very similar to this one this would be *terrible* advice. The token that would be lost here is the 'Secure' field on the cookie, which is an extremely important token to have: if we don't correctly parse it, we run the risk of sending the cookie on plaintext connections.

Discarding data is the problem, and while discarding *less* data is an improvement, it would be good if we could resolve this problem in such a way that we'd have correctly parsed this header.

Generally speaking, if we treat these as continuation lines I think we have the best change of making a useful header out of this mess.
msg244733 - (view) Author: Cory Benfield (Lukasa) * Date: 2015-06-03 09:01
While we're here and I'm recommending to drop as little data as possible: we need to be really careful about not exposing ourselves to any kind of data smuggling attack here.

It's really important that we don't let attackers construct bodies of requests or responses that will cause us to misinterpret header blocks. It's therefore going to be really tricky to balance those concerns.
msg244752 - (view) Author: Michael Del Monte (mgdelmonte) Date: 2015-06-03 14:36
Given that obs-fold is technically valid, then can I recommend reading the entire header first (reading to the first blank line) and then tokenizing the individual headers using a regular expression rather than line by line?  That would solve the problem more elegantly and easily than attempting to read lines on the fly and then "unreading" the nonconforming lines.

Here's my recommendation:

    def readheaders(self):
        self.dict = {}
        self.unixfrom = ''
        self.headers = hlist = []
        self.status = ''
        # read entire header (read until first blank line)
        while True:
            line = self.fp.readline(_MAXLINE+1)
            if not line:
                self.status = 'EOF in headers'
                break
            if len(line) > _MAXLINE:
                raise LineTooLong("header line")
            hlist.append(line)
            if line in ('\n', '\r\n'):
                break
            if len(hlist) > _MAXHEADERS:
                raise HTTPException("got more than %d headers" % _MAXHEADERS)
        # reproduce and parse as string
        header = "\n".join(hlist)
        self.headers = re.findall(r"[^ \n][^\n]+\n(?: +[^\n]+\n)*", header)
        firstline = True
        for line in self.headers:
            if firstline and line.startswith('From '):
                self.unixfrom = self.unixfrom + line
                continue
            firstline = False
            if ':' in line:
                k,v = line.split(':',1)
                self.addheader(k, re.sub("\n +", " ", v.strip()))
            else:
                self.status = 'Non-header line where header expected' if self.dict else 'No headers'


I think this handles everything you're trying to do.  I don't understand the unixfrom bit, but I think I have it right.

As for Cory's concern re: smuggling, _MAXLINE and _MAXHEADERS should help with that.  The regexp guarantees that every line plus continuation appears as a single header.

I use re.sub("\n +", " ", v.strip()) to clean the value and remove the continuation.
msg244753 - (view) Author: Michael Del Monte (mgdelmonte) Date: 2015-06-03 14:41
... or perhaps

if ':' in line and line[0] != ':':

to avoid the colon-as-first-char bug that plagued this library earlier, though the only ill-effect of leaving it alone would be a header with a blank key; not the end of the world.
msg255557 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-29 01:24
Since the Python 2 and Python 3 branches are different, two different patches would be needed here. Perhaps they could share common test cases though.

Michael: I presume your proposal is for Python 2. I don’t understand the re.findall() expression; is there a clearer way to do whatever it is trying to do (or failing that, explain it in a comment)? It looks like you are trying to skip over spaces at the start of the first header field name. Also, it seems to drop support for lines folded with tabs rather than spaces.

David: The headers-only mode wouldn’t make much difference because it only affects parsing the “payload”. As far as the email package is concerned, the payload should always be empty when used by HTTP’s parse_headers().

The simplest fix (at least for the Python 3 code) would be to check each line against email.feedparser.headerRE before adding it to the list of lines in the HTTP package. This would prevent the email parser from bailing from the header section early, which is the main problem.

But that does seem like a bad hack and wouldn’t treat the offending line as a folded line, which Cory and David want. So I guess we need to make the email parser more flexible instead. Maybe a private FeedParser._parse_header_lines() method or something, that replaces feed() and close(), and defers most of the processing directly to _parse_headers().
msg268408 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-13 00:02
See also Issue 26686; the same problem, but with parsing RFC5322 header fields, rather than HTTP.
msg268544 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-14 09:01
I made a patch to fix all header section parsing by default in the email module (see Issue 26686). I’m not 100% sure if it is safe in general, but if it is, it would fix this bug.
msg272537 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-12 12:37
In order to avoid messing too much with the intricacies of the existing email parsing, here is a patch for Python 3 that limits the behaviour changes to the HTTP module. It should fix the bad handling of broken header lines. As a side effect, it should also fix Issue 22233, since it bypasses the offending splitlines() call. I incorporated the test cases from my previous patches for Issue 26686 and Issue 22233.

I tried to maintain a minimal compatibility with the previous special behaviour for message/* and multipart/* message types, although I didn’t bother trying to emulate e.g. StartBoundaryNotFoundDefect (http.client.parse_headers() doesn’t pass any body to the email parser, so it won’t see any start boundaries.)
msg275244 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-09 03:47
Updated patch for Python 3 now that Issue 22233 has been fixed.
msg275366 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-09 18:09
I think you can greatly simplify this code by using BytesFeedParser and feeding the input to it line by line.  You don't need to gather the headers first, since you control how much you read and how much you send to the parser.  The change to email is then the handling of lines without colons.  The simplest fix is to add a policy control for whether a line-that-doesn't-look-like-a-header is considered part of the preceding header or treated as part of the body (or, to put it another way, do we really only believe the body starts after a blank line or not).  You can make that a setting on Policy, and derive the policy used here in http from Compat32, so as to otherwise minimize the behavior changes.  If we can't get this done by the beta cutoff on Monday, we can make the new policy setting start with an '_' in 3.6, since the point here is the bug fix, not the feature; and then make it public in 3.7 (since I think it is useful for more than just http).
msg275368 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-09 18:11
I'm not sure it is a good idea to backport this to 2.7, but if you want to, the same fix can be made in a more hackish way on 2.7 by putting a private variable on FeedParser to control the header parsing behavior.
msg275370 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-09 18:14
Hmm.  Or maybe the latin-1 decode and FeedParser is better, since with BytesFeedParser and non-ascii you get Header objects, which you don't want.  Either way, there's no TextIOWrapper involved.
msg275473 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-09 22:23
I guess we could add this secret policy flag that the email parser checks. The solution should still be applied as a bug fix to 3.5 as well as 3.6+. I would have to make the flag “very” unique, to reduce the chance of it breaking user code. I.e. adding policy._strict_end_of_headers might interfere with somebody’s user-defined policy class, but maybe policy.__strict_end_of_headers__ is safe, since user code is not supposed to invent such attribute names. Or do you think it “_strict_end_of_headers” name is fair game since the policy stuff is/was provisional?

It would be good to fix 2.7. This bug, and the duplicate report, were both originally reported for 2.7. I haven’t looked at the situation there in detail, but it looks completely different. HTTPMessage is based on the “mimetools” package, and there is a lot more parsing code already in “httplib”.
msg275484 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-09 22:58
Oh, yes, I forgot 2.7 was using the older code.

Sure, that flag name sounds fine.  I'm not worrying about policy flag name collisions, but perhaps I should be.  How about _py_strict_end_of_headers?  I don't really care what it is named for the bug fix, I'll think about the exposed name for 3.7.
msg275582 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-10 06:49
Here is a fix using a policy flag. I called it “_py_body_detached”.
msg275607 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-10 09:38
Patch for Python 2
msg276670 - (view) Author: Roundup Robot (python-dev) Date: 2016-09-16 03:06
New changeset c38e10ad7c89 by Martin Panter in branch '2.7':
Issue #24363: Continue parsing HTTP header in spite of invalid lines
https://hg.python.org/cpython/rev/c38e10ad7c89
msg276672 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-16 03:34
I pushed my Py 2 patch, since it is simpler and does not interfere with other modules. But it would still be good to get feedback on policy-flag.patch for Python 3.
msg276896 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-18 15:20
I will try to review this in the not too distant future.  You can ping me if I don't get to it by next Saturday.

I think I'll probably prefer to call the flag something like _greedy_header_parsing, to reflect a change from assuming we've got body on a non-header-like line to assuming we've still got header if we haven't seen a blank line yet.  What we call the private version of the flag doesn't matter all that much, though.
msg286138 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-01-24 05:27
Just a minor update with an extra get_payload() test I missed before
msg287228 - (view) Author: Guillaume Boudreau (gboudreau) Date: 2017-02-07 11:35
Any chance this could get reviewed and merged soon? I got hit by a similar issue (see #29445) where the server, which I don't control, sends me invalid HTTP headers, and the prevents all the headers that follow it to not be parsed.
The latest attached patch fixed the issue for me.
Thanks.
msg287237 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-02-07 15:12
Yeah, I'm going to try to get to this this weekend.
History
Date User Action Args
2017-02-07 15:12:16r.david.murraysetmessages: + msg287237
2017-02-07 11:35:35gboudreausetnosy: + gboudreau
messages: + msg287228
2017-02-04 20:58:55r.david.murraylinkissue29445 superseder
2017-01-24 05:27:42martin.pantersetfiles: + policy-flag.v2.patch

messages: + msg286138
versions: + Python 3.7
2016-09-18 15:20:24r.david.murraysetmessages: + msg276896
2016-09-16 03:34:55martin.pantersetmessages: + msg276672
2016-09-16 03:06:29python-devsetnosy: + python-dev
messages: + msg276670
2016-09-10 09:38:32martin.pantersetfiles: + header.py2.patch

messages: + msg275607
2016-09-10 06:49:51martin.pantersetfiles: + policy-flag.patch

messages: + msg275582
2016-09-09 22:58:31r.david.murraysetmessages: + msg275484
2016-09-09 22:23:28martin.pantersetmessages: + msg275473
2016-09-09 18:14:24r.david.murraysetmessages: + msg275370
2016-09-09 18:11:58r.david.murraysetmessages: + msg275368
2016-09-09 18:10:29orsenthilsetnosy: + orsenthil
2016-09-09 18:09:19r.david.murraysetmessages: + msg275366
2016-09-09 03:47:43martin.pantersetfiles: + bypass-parsegen.v2.patch

messages: + msg275244
2016-08-12 12:37:54martin.pantersetfiles: + bypass-parsegen.patch
messages: + msg272537

dependencies: - email.parser stops parsing headers too soon when given a defective message.
keywords: + patch
stage: needs patch -> patch review
2016-06-14 09:01:07martin.pantersetdependencies: + email.parser stops parsing headers too soon when given a defective message.
messages: + msg268544
2016-06-13 00:02:35martin.pantersetmessages: + msg268408
2016-06-11 15:08:06martin.pantersetversions: + Python 3.6, - Python 3.4
2016-06-11 15:04:09martin.panterlinkissue27296 superseder
2015-11-29 01:24:17martin.pantersetmessages: + msg255557
2015-06-03 14:41:46mgdelmontesetmessages: + msg244753
2015-06-03 14:36:26mgdelmontesetmessages: + msg244752
2015-06-03 14:04:49piotr.dobrogostsetnosy: + piotr.dobrogost
2015-06-03 09:01:42Lukasasetmessages: + msg244733
2015-06-03 08:21:34Lukasasetmessages: + msg244732
2015-06-03 01:25:09r.david.murraysetmessages: + msg244722
2015-06-03 00:58:05martin.pantersetnosy: + martin.panter
messages: + msg244721
2015-06-03 00:48:35r.david.murraysetmessages: + msg244720
2015-06-03 00:16:27mgdelmontesetmessages: + msg244719
2015-06-02 22:29:40demian.brechtsetnosy: + demian.brecht
2015-06-02 22:22:15r.david.murraysetmessages: + msg244704
2015-06-02 22:11:53Lukasasetnosy: + Lukasa
messages: + msg244703
2015-06-02 16:28:55r.david.murraysetkeywords: + easy
nosy: + barry

components: + email
stage: needs patch
2015-06-02 16:28:01r.david.murraysetmessages: + msg244678
versions: - Python 3.3
2015-06-02 16:08:58r.david.murraysetnosy: + r.david.murray
messages: + msg244676
2015-06-02 15:47:17mgdelmontesetmessages: + msg244675
2015-06-02 15:24:44icordascsetmessages: + msg244674
versions: + Python 3.3, Python 3.4, Python 3.5
2015-06-02 15:22:40icordascsetnosy: + icordasc
messages: + msg244673
2015-06-02 15:06:03mgdelmontecreate