httplib fails to handle semivalid HTTP headers #68551

mgdelmonte · 2015-06-02T15:06:04Z

BPO	24363
Nosy	@warsaw, @orsenthil, @bitdancer, @vadmium, @sigmavirus24, @demianbrecht, @maxking, @Lukasa, @gboudreau
PRs	bpo-36226: Fix multipart false positive header defects #12214
Files	bypass-parsegen.patch: Python 3 bypass-parsegen.v2.patch policy-flag.patch header.py2.patch policy-flag.v2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-06-02.15:06:03.770>
labels = ['3.7', '3.8', 'type-bug', 'library', 'expert-email']
title = 'httplib fails to handle semivalid HTTP headers'
updated_at = <Date 2019-09-16.10:07:45.075>
user = 'https://bugs.python.org/mgdelmonte'

bugs.python.org fields:

activity = <Date 2019-09-16.10:07:45.075>
actor = 'cschmidbauer'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'email']
creation = <Date 2015-06-02.15:06:03.770>
creator = 'mgdelmonte'
dependencies = []
files = ['44087', '44482', '44524', '44527', '46400']
hgrepos = []
issue_num = 24363
keywords = ['patch']
message_count = 35.0
messages = ['244672', '244673', '244674', '244675', '244676', '244678', '244703', '244704', '244719', '244720', '244721', '244722', '244732', '244733', '244752', '244753', '255557', '268408', '268544', '272537', '275244', '275366', '275368', '275370', '275473', '275484', '275582', '275607', '276670', '276672', '276896', '286138', '287228', '287237', '352448']
nosy_count = 12.0
nosy_names = ['barry', 'orsenthil', 'r.david.murray', 'python-dev', 'martin.panter', 'piotr.dobrogost', 'icordasc', 'demian.brecht', 'maxking', 'Lukasa', 'mgdelmonte', 'gboudreau']
pr_nums = ['12214']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue24363'
versions = ['Python 3.7', 'Python 3.8']

mgdelmonte · 2015-06-02T15:06:03Z

Initially reported at https://github.com/kennethreitz/requests/issues/2622

Closely related to http://bugs.python.org/issue19996

An HTTP response with an invalid header line that contains non-blank characters but *no* colon (contrast http://bugs.python.org/issue19996 in which it contained a colon as the first character) causes the same behavior.

httplib.HTTPMessage.readheaders() oddly does not appear even to attempt to follow RFC 2616, which requires the header to terminate with a blank line. The invalid header line, which admittedly also breaks RFC 2616, is at least non-blank and should not terminate the header. Yet readheaders() takes it as an indicator that the header is over and then fails properly to process the rest of the response.

The problem is exacerbated by a chunked encoding, which will not be properly received if the encoding header is not seen because readheaders() terminates early. An example (why are banks always the miscreants here?) is:

p = response.get("http://www.merrickbank.com/")

My recommended fix would be to insert these lines at httplib:327

            # continue reading headers on non-blank lines
            elif not len(line.strip()):
                continue
            # break only on blank lines

This would cause readheaders() to terminate only on a non-blank non-header non-comment line, in accordance with RFC 2616.

sigmavirus24 · 2015-06-02T15:22:40Z

FWIW, the proper section to reference now is 3.2 in RFC 7230 (https://tools.ietf.org/html/rfc7230#section-3.2)

sigmavirus24 · 2015-06-02T15:24:44Z

Also I'm marking this as affecting 3.3, 3.4, and 3.5. I haven't tested against 3.5, but it definitely fails on 3.4. I hope to be able to test against 3.5.0b2 tonight

mgdelmonte · 2015-06-02T15:47:18Z

Thanks. Also I meant to have said, "...to terminate only on a *blank* non-header non-comment line, in accordance with RFC 2616 (and 7230)."

I note that the RFCs require CRLF to terminate but in my experience you can get all manner of "blank" lines, so accepting len(line.strip())==0 is going to accommodate servers that give CRCR or LFLF or (and I have seen this) LFCRLF.

bitdancer · 2015-06-02T16:08:58Z

The current behavior probably comes out of the RFC822 world, where when seeing a line that doesn't look like a header the error recovery is to treat that line as the beginning of the body (ie: assume the blank line is missing).

Is there in fact any guidance on what the optimal error recovery strategy is for http? I've actually considered trying to make the error recovery in email smarter, by seeing if the *next* line looks like a header before making the decision.

bitdancer · 2015-06-02T16:28:01Z

Ah, in fact that's exactly where it comes from, since httplib uses the email header parsing code. In python3 we are actually using the email package to parse the headers (which is sensible) (in 2.7 it is a copy of code from the old mimelib with some tweaks). Fixing this in python3 is best done by making the error recovery enhancement(s) I mentioned in the email package.

Please note that changing this behavior has the potential to break working code. That is, just as we have here a server that is out of spec and sending invalid links in the middle of headers, we may have a server that is out of spec by not sending the blank delimiter, which is currently being handled "correctly". Thus I don't think your simple fix is advisable, instead I think we should pursue the more complicated "look ahead" fix.

Lukasa · 2015-06-02T22:11:54Z

This is one of those bugs that's actually super tricky to correctly fix.

The correct path is to have the goal of conservatively accepting as many headers as possible. Probably this means looking ahead to the next few lines and seeing if they appear to roughly keep a header structure (\r\n line breaks and colon separated values). However, I'm not entirely sure how to structure that sentiment in code at this time.

bitdancer · 2015-06-02T22:22:15Z

I think there may be a way to accomplish this in a reasonably straightforward fashion in python3 given that feedparser has an 'unreadline' function. The python2 case is probably going to be a more complicated change. And I agree that multiple lines should be examined.

There is also the question of what to do with the extra data. I think the correct approach is to treat a line that does not conform as a line that is missing the leading whitespace that would make it a continuation line. It looks like that would certainly be correct for the case in hand.

mgdelmonte · 2015-06-03T00:16:27Z

I don't want to speak out of school and you guys certainly know what you're doing, but it seems a shame to go through these gyrations -- lookahead plus "unreading" lines -- only to preserve the ability to parse email headers, when HTTP really does follow a different spec. My suggestion would be to examine the header and decide, if it's HTTP, to just ignore nonconforming lines; and if it's email, then the problem is already solved (as email doesn't have encoding rules that would cause problems later).

My fear would be that you'll eventually get that nonconforming line with leading whitespace, which will lead right back to the same error.

bitdancer · 2015-06-03T00:48:35Z

No, the point is to do "best practical" error recovery when faced with dirty data that may be dirty in various ways, and it doesn't really matter whether it is http headers or email headers. A line with leading whitespace is treated as part of the preceding header line now, and this is the way it should behave, since the older http standards adopted that behavior from rfc822. You will note that the standard referenced by Ian is explicit about that, in the obs-fold clause. That is, we are required by the standard to support that behavior, which is why I posit that the best recovery is to assume an invalid line followed by what look like headers is in fact an incorrectly folded obs-fold continuation line. That this will also conform to the email standard is a not-accidental consequence of how these standards evolved. (That is, email and http header handling are *not* "different" specs in the sense of being disjoint, they are derivatives of a common ancestor spec and some effort is spent keeping them interoperable, to my understanding.)

vadmium · 2015-06-03T00:58:04Z

Regarding the suggested fix for Python 2, make sure it does not prematurely end the parsing on empty folded lines (having only tabs and spaces in them). E.g. according to RFC 7230 this should be a single header field:

b"Header: obsolete but\r\n"
b" \r\n"
b" still valid\r\n"

I suspect the RFC doesn’t say anything specifically about this case. In general the guidance seems to be things like:

User agents should be tolerant of errors received in the protocol
Proxies should fix up errors when forwarding messages upstream
Servers should often reject errors in requests with 400 Bad Request (presumably to avoid the possibility of a downstream proxy being tricked by the protocol error and not triggering some security filter)

In the case of the bank web site, the last lines of the header are:

X-Frame-Options: SAMEORIGIN\r\n
Set-Cookie: mb-CookieP=; HttpOnly; \r\n
Secure\r\n
Set-Cookie: mb-CookieP=; HttpOnly; Secure\r\n
\r\n

It is obvious that this case could be treated as a folded (continuation) line. But in general I think it would be better to ignore the erroneous line, or to record it as a defect so that the server module or other user can check it.

Looking at the Python 3 code, both the client and server call http.client._parse_headers(), which sensibly reads each line until it sees a blank line (Lib/http/client.py:197). But then after jumping through some more hoops we parse it again until we find a line that fails the regular expression at Lib/email/feedparser.py:37. The remaining lines become the “payload” of the HTTP header:

>>> with urlopen("http://www.merrickbank.com/") as response:
...     response.info().get_payload()
... 
'Secure\r\nSet-Cookie: mb-CookieP=; HttpOnly; Secure\r\n\r\n'

What might be nice is a way to reuse the email header field parsing code, without worrying about the “From” line stuff, or the payload stuff.

bitdancer · 2015-06-03T01:25:09Z

Since the email package has the correct logic for handling the blank continuation line case (even in Python2) (because, again, that derives from the original email standard), it might be reasonable to use feedparser's headersonly mode. If necessary we can introduce a (private) variation that actually stops parsing the stream at the end of the headers. (This would be a non-public API on feedparser for 2.7 and 3.4/5, and a public one in 3.6.) Since feedparser was itself a rewrite of the mimetools code httplib is using and theoretically backward compatible with it in behavior, this should not in theory introduce any behavior changes (famous last words?).

The problem here is the good chance that someone is depending on the internal implementation of HTTPMessage. So, I'm not at all sure about this suggestion.

And yes, the email package in python3 records header defects. However, I think it is better to treat the no-leading-blank non-header as an incorrectly folded continuation line rather than discarding the data. The data is in the headers section for *some* reason, and IMO that is the only reasonable guess as to why.

Lukasa · 2015-06-03T08:21:34Z

It is obvious that this case could be treated as a folded (continuation) line. But in general I think it would be better to ignore the erroneous line, or to record it as a defect so that the server module or other user can check it.

Just to clarify, in an instance very similar to this one this would be *terrible* advice. The token that would be lost here is the 'Secure' field on the cookie, which is an extremely important token to have: if we don't correctly parse it, we run the risk of sending the cookie on plaintext connections.

Discarding data is the problem, and while discarding *less* data is an improvement, it would be good if we could resolve this problem in such a way that we'd have correctly parsed this header.

Generally speaking, if we treat these as continuation lines I think we have the best change of making a useful header out of this mess.

Lukasa · 2015-06-03T09:01:42Z

While we're here and I'm recommending to drop as little data as possible: we need to be really careful about not exposing ourselves to any kind of data smuggling attack here.

It's really important that we don't let attackers construct bodies of requests or responses that will cause us to misinterpret header blocks. It's therefore going to be really tricky to balance those concerns.

mgdelmonte · 2015-06-03T14:36:26Z

Given that obs-fold is technically valid, then can I recommend reading the entire header first (reading to the first blank line) and then tokenizing the individual headers using a regular expression rather than line by line? That would solve the problem more elegantly and easily than attempting to read lines on the fly and then "unreading" the nonconforming lines.

Here's my recommendation:

    def readheaders(self):
        self.dict = {}
        self.unixfrom = ''
        self.headers = hlist = []
        self.status = ''
        # read entire header (read until first blank line)
        while True:
            line = self.fp.readline(_MAXLINE+1)
            if not line:
                self.status = 'EOF in headers'
                break
            if len(line) > _MAXLINE:
                raise LineTooLong("header line")
            hlist.append(line)
            if line in ('\n', '\r\n'):
                break
            if len(hlist) > _MAXHEADERS:
                raise HTTPException("got more than %d headers" % _MAXHEADERS)
        # reproduce and parse as string
        header = "\n".join(hlist)
        self.headers = re.findall(r"[^ \n][^\n]+\n(?: +[^\n]+\n)*", header)
        firstline = True
        for line in self.headers:
            if firstline and line.startswith('From '):
                self.unixfrom = self.unixfrom + line
                continue
            firstline = False
            if ':' in line:
                k,v = line.split(':',1)
                self.addheader(k, re.sub("\n +", " ", v.strip()))
            else:
                self.status = 'Non-header line where header expected' if self.dict else 'No headers'

I think this handles everything you're trying to do. I don't understand the unixfrom bit, but I think I have it right.

As for Cory's concern re: smuggling, _MAXLINE and _MAXHEADERS should help with that. The regexp guarantees that every line plus continuation appears as a single header.

I use re.sub("\n +", " ", v.strip()) to clean the value and remove the continuation.

mgdelmonte · 2015-06-03T14:41:46Z

... or perhaps

if ':' in line and line[0] != ':':

to avoid the colon-as-first-char bug that plagued this library earlier, though the only ill-effect of leaving it alone would be a header with a blank key; not the end of the world.

vadmium · 2015-11-29T01:24:15Z

Since the Python 2 and Python 3 branches are different, two different patches would be needed here. Perhaps they could share common test cases though.

Michael: I presume your proposal is for Python 2. I don’t understand the re.findall() expression; is there a clearer way to do whatever it is trying to do (or failing that, explain it in a comment)? It looks like you are trying to skip over spaces at the start of the first header field name. Also, it seems to drop support for lines folded with tabs rather than spaces.

David: The headers-only mode wouldn’t make much difference because it only affects parsing the “payload”. As far as the email package is concerned, the payload should always be empty when used by HTTP’s parse_headers().

The simplest fix (at least for the Python 3 code) would be to check each line against email.feedparser.headerRE before adding it to the list of lines in the HTTP package. This would prevent the email parser from bailing from the header section early, which is the main problem.

But that does seem like a bad hack and wouldn’t treat the offending line as a folded line, which Cory and David want. So I guess we need to make the email parser more flexible instead. Maybe a private FeedParser._parse_header_lines() method or something, that replaces feed() and close(), and defers most of the processing directly to _parse_headers().

vadmium · 2016-06-13T00:02:35Z

See also bpo-26686; the same problem, but with parsing RFC5322 header fields, rather than HTTP.

vadmium · 2016-06-14T09:01:08Z

I made a patch to fix all header section parsing by default in the email module (see bpo-26686). I’m not 100% sure if it is safe in general, but if it is, it would fix this bug.

vadmium · 2016-08-12T12:37:53Z

In order to avoid messing too much with the intricacies of the existing email parsing, here is a patch for Python 3 that limits the behaviour changes to the HTTP module. It should fix the bad handling of broken header lines. As a side effect, it should also fix bpo-22233, since it bypasses the offending splitlines() call. I incorporated the test cases from my previous patches for bpo-26686 and bpo-22233.

I tried to maintain a minimal compatibility with the previous special behaviour for message/* and multipart/* message types, although I didn’t bother trying to emulate e.g. StartBoundaryNotFoundDefect (http.client.parse_headers() doesn’t pass any body to the email parser, so it won’t see any start boundaries.)

vadmium · 2016-09-09T03:47:42Z

Updated patch for Python 3 now that bpo-22233 has been fixed.

bitdancer · 2016-09-09T18:09:19Z

I think you can greatly simplify this code by using BytesFeedParser and feeding the input to it line by line. You don't need to gather the headers first, since you control how much you read and how much you send to the parser. The change to email is then the handling of lines without colons. The simplest fix is to add a policy control for whether a line-that-doesn't-look-like-a-header is considered part of the preceding header or treated as part of the body (or, to put it another way, do we really only believe the body starts after a blank line or not). You can make that a setting on Policy, and derive the policy used here in http from Compat32, so as to otherwise minimize the behavior changes. If we can't get this done by the beta cutoff on Monday, we can make the new policy setting start with an '_' in 3.6, since the point here is the bug fix, not the feature; and then make it public in 3.7 (since I think it is useful for more than just http).

bitdancer · 2016-09-09T18:11:59Z

I'm not sure it is a good idea to backport this to 2.7, but if you want to, the same fix can be made in a more hackish way on 2.7 by putting a private variable on FeedParser to control the header parsing behavior.

bitdancer · 2016-09-09T18:14:25Z

Hmm. Or maybe the latin-1 decode and FeedParser is better, since with BytesFeedParser and non-ascii you get Header objects, which you don't want. Either way, there's no TextIOWrapper involved.

vadmium · 2016-09-09T22:23:28Z

I guess we could add this secret policy flag that the email parser checks. The solution should still be applied as a bug fix to 3.5 as well as 3.6+. I would have to make the flag “very” unique, to reduce the chance of it breaking user code. I.e. adding policy._strict_end_of_headers might interfere with somebody’s user-defined policy class, but maybe policy.__strict_end_of_headers__ is safe, since user code is not supposed to invent such attribute names. Or do you think it “_strict_end_of_headers” name is fair game since the policy stuff is/was provisional?

It would be good to fix 2.7. This bug, and the duplicate report, were both originally reported for 2.7. I haven’t looked at the situation there in detail, but it looks completely different. HTTPMessage is based on the “mimetools” package, and there is a lot more parsing code already in “httplib”.

bitdancer · 2016-09-09T22:58:31Z

Oh, yes, I forgot 2.7 was using the older code.

Sure, that flag name sounds fine. I'm not worrying about policy flag name collisions, but perhaps I should be. How about _py_strict_end_of_headers? I don't really care what it is named for the bug fix, I'll think about the exposed name for 3.7.

vadmium · 2016-09-10T06:49:50Z

Here is a fix using a policy flag. I called it “_py_body_detached”.

vadmium · 2016-09-10T09:38:32Z

Patch for Python 2

python-dev · 2016-09-16T03:06:30Z

New changeset c38e10ad7c89 by Martin Panter in branch '2.7':
Issue bpo-24363: Continue parsing HTTP header in spite of invalid lines
https://hg.python.org/cpython/rev/c38e10ad7c89

vadmium · 2016-09-16T03:34:55Z

I pushed my Py 2 patch, since it is simpler and does not interfere with other modules. But it would still be good to get feedback on policy-flag.patch for Python 3.

bitdancer · 2016-09-18T15:20:25Z

I will try to review this in the not too distant future. You can ping me if I don't get to it by next Saturday.

I think I'll probably prefer to call the flag something like _greedy_header_parsing, to reflect a change from assuming we've got body on a non-header-like line to assuming we've still got header if we haven't seen a blank line yet. What we call the private version of the flag doesn't matter all that much, though.

vadmium · 2017-01-24T05:27:39Z

Just a minor update with an extra get_payload() test I missed before

gboudreau · 2017-02-07T11:35:35Z

Any chance this could get reviewed and merged soon? I got hit by a similar issue (see bpo-29445) where the server, which I don't control, sends me invalid HTTP headers, and the prevents all the headers that follow it to not be parsed.
The latest attached patch fixed the issue for me.
Thanks.

bitdancer · 2017-02-07T15:12:17Z

Yeah, I'm going to try to get to this this weekend.

maxking · 2019-09-14T19:02:01Z

Martin: Can you please create a PR for the added patch? If you are busy, I can do that for you, just wanted to ask before I do :)

I am going to remove "easy" label from this issue, which IMO it clearly isn't given 4 years of history to catch up on and a few other related bugs.

Since some time has passed since the patch was posted, I am going to target this to 3.8 and 3.7 since others are in security fix mode only.

mgdelmonte mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 2, 2015

bitdancer added easy topic-email labels Jun 2, 2015

vadmium added the 3.7 (EOL) end of life label Jan 24, 2017

maxking added 3.8 only security fixes and removed easy labels Sep 14, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

httplib fails to handle semivalid HTTP headers #68551

httplib fails to handle semivalid HTTP headers #68551

Comments

mgdelmonte mannequin commented Jun 2, 2015

mgdelmonte mannequin commented Jun 2, 2015

sigmavirus24 mannequin commented Jun 2, 2015

sigmavirus24 mannequin commented Jun 2, 2015

mgdelmonte mannequin commented Jun 2, 2015

bitdancer commented Jun 2, 2015

bitdancer commented Jun 2, 2015

Lukasa mannequin commented Jun 2, 2015

bitdancer commented Jun 2, 2015

mgdelmonte mannequin commented Jun 3, 2015

bitdancer commented Jun 3, 2015

vadmium commented Jun 3, 2015

bitdancer commented Jun 3, 2015

Lukasa mannequin commented Jun 3, 2015

Lukasa mannequin commented Jun 3, 2015

mgdelmonte mannequin commented Jun 3, 2015

mgdelmonte mannequin commented Jun 3, 2015

vadmium commented Nov 29, 2015

vadmium commented Jun 13, 2016

vadmium commented Jun 14, 2016

vadmium commented Aug 12, 2016

vadmium commented Sep 9, 2016

bitdancer commented Sep 9, 2016

bitdancer commented Sep 9, 2016

bitdancer commented Sep 9, 2016

vadmium commented Sep 9, 2016

bitdancer commented Sep 9, 2016

vadmium commented Sep 10, 2016

vadmium commented Sep 10, 2016

python-dev mannequin commented Sep 16, 2016

vadmium commented Sep 16, 2016

bitdancer commented Sep 18, 2016

vadmium commented Jan 24, 2017

gboudreau mannequin commented Feb 7, 2017

bitdancer commented Feb 7, 2017

maxking commented Sep 14, 2019