Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httplib fails to handle semivalid HTTP headers #68551

Open
mgdelmonte mannequin opened this issue Jun 2, 2015 · 35 comments
Open

httplib fails to handle semivalid HTTP headers #68551

mgdelmonte mannequin opened this issue Jun 2, 2015 · 35 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@mgdelmonte
Copy link
Mannequin

mgdelmonte mannequin commented Jun 2, 2015

BPO 24363
Nosy @warsaw, @orsenthil, @bitdancer, @vadmium, @sigmavirus24, @demianbrecht, @maxking, @Lukasa, @gboudreau
PRs
  • bpo-36226: Fix multipart false positive header defects #12214
  • Files
  • bypass-parsegen.patch: Python 3
  • bypass-parsegen.v2.patch
  • policy-flag.patch
  • header.py2.patch
  • policy-flag.v2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2015-06-02.15:06:03.770>
    labels = ['3.7', '3.8', 'type-bug', 'library', 'expert-email']
    title = 'httplib fails to handle semivalid HTTP headers'
    updated_at = <Date 2019-09-16.10:07:45.075>
    user = 'https://bugs.python.org/mgdelmonte'

    bugs.python.org fields:

    activity = <Date 2019-09-16.10:07:45.075>
    actor = 'cschmidbauer'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)', 'email']
    creation = <Date 2015-06-02.15:06:03.770>
    creator = 'mgdelmonte'
    dependencies = []
    files = ['44087', '44482', '44524', '44527', '46400']
    hgrepos = []
    issue_num = 24363
    keywords = ['patch']
    message_count = 35.0
    messages = ['244672', '244673', '244674', '244675', '244676', '244678', '244703', '244704', '244719', '244720', '244721', '244722', '244732', '244733', '244752', '244753', '255557', '268408', '268544', '272537', '275244', '275366', '275368', '275370', '275473', '275484', '275582', '275607', '276670', '276672', '276896', '286138', '287228', '287237', '352448']
    nosy_count = 12.0
    nosy_names = ['barry', 'orsenthil', 'r.david.murray', 'python-dev', 'martin.panter', 'piotr.dobrogost', 'icordasc', 'demian.brecht', 'maxking', 'Lukasa', 'mgdelmonte', 'gboudreau']
    pr_nums = ['12214']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue24363'
    versions = ['Python 3.7', 'Python 3.8']

    @mgdelmonte
    Copy link
    Mannequin Author

    mgdelmonte mannequin commented Jun 2, 2015

    Initially reported at https://github.com/kennethreitz/requests/issues/2622

    Closely related to http://bugs.python.org/issue19996

    An HTTP response with an invalid header line that contains non-blank characters but *no* colon (contrast http://bugs.python.org/issue19996 in which it contained a colon as the first character) causes the same behavior.

    httplib.HTTPMessage.readheaders() oddly does not appear even to attempt to follow RFC 2616, which requires the header to terminate with a blank line. The invalid header line, which admittedly also breaks RFC 2616, is at least non-blank and should not terminate the header. Yet readheaders() takes it as an indicator that the header is over and then fails properly to process the rest of the response.

    The problem is exacerbated by a chunked encoding, which will not be properly received if the encoding header is not seen because readheaders() terminates early. An example (why are banks always the miscreants here?) is:

    p = response.get("http://www.merrickbank.com/")

    My recommended fix would be to insert these lines at httplib:327

                # continue reading headers on non-blank lines
                elif not len(line.strip()):
                    continue
                # break only on blank lines
    

    This would cause readheaders() to terminate only on a non-blank non-header non-comment line, in accordance with RFC 2616.

    @mgdelmonte mgdelmonte mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 2, 2015
    @sigmavirus24
    Copy link
    Mannequin

    sigmavirus24 mannequin commented Jun 2, 2015

    FWIW, the proper section to reference now is 3.2 in RFC 7230 (https://tools.ietf.org/html/rfc7230#section-3.2)

    @sigmavirus24
    Copy link
    Mannequin

    sigmavirus24 mannequin commented Jun 2, 2015

    Also I'm marking this as affecting 3.3, 3.4, and 3.5. I haven't tested against 3.5, but it definitely fails on 3.4. I hope to be able to test against 3.5.0b2 tonight

    @mgdelmonte
    Copy link
    Mannequin Author

    mgdelmonte mannequin commented Jun 2, 2015

    Thanks. Also I meant to have said, "...to terminate only on a *blank* non-header non-comment line, in accordance with RFC 2616 (and 7230)."

    I note that the RFCs require CRLF to terminate but in my experience you can get all manner of "blank" lines, so accepting len(line.strip())==0 is going to accommodate servers that give CRCR or LFLF or (and I have seen this) LFCRLF.

    @bitdancer
    Copy link
    Member

    The current behavior probably comes out of the RFC822 world, where when seeing a line that doesn't look like a header the error recovery is to treat that line as the beginning of the body (ie: assume the blank line is missing).

    Is there in fact any guidance on what the optimal error recovery strategy is for http? I've actually considered trying to make the error recovery in email smarter, by seeing if the *next* line looks like a header before making the decision.

    @bitdancer
    Copy link
    Member

    Ah, in fact that's exactly where it comes from, since httplib uses the email header parsing code. In python3 we are actually using the email package to parse the headers (which is sensible) (in 2.7 it is a copy of code from the old mimelib with some tweaks). Fixing this in python3 is best done by making the error recovery enhancement(s) I mentioned in the email package.

    Please note that changing this behavior has the potential to break working code. That is, just as we have here a server that is out of spec and sending invalid links in the middle of headers, we may have a server that is out of spec by not sending the blank delimiter, which is currently being handled "correctly". Thus I don't think your simple fix is advisable, instead I think we should pursue the more complicated "look ahead" fix.

    @Lukasa
    Copy link
    Mannequin

    Lukasa mannequin commented Jun 2, 2015

    This is one of those bugs that's actually super tricky to correctly fix.

    The correct path is to have the goal of conservatively accepting as many headers as possible. Probably this means looking ahead to the next few lines and seeing if they appear to roughly keep a header structure (\r\n line breaks and colon separated values). However, I'm not entirely sure how to structure that sentiment in code at this time.

    @bitdancer
    Copy link
    Member

    I think there may be a way to accomplish this in a reasonably straightforward fashion in python3 given that feedparser has an 'unreadline' function. The python2 case is probably going to be a more complicated change. And I agree that multiple lines should be examined.

    There is also the question of what to do with the extra data. I think the correct approach is to treat a line that does not conform as a line that is missing the leading whitespace that would make it a continuation line. It looks like that would certainly be correct for the case in hand.

    @mgdelmonte
    Copy link
    Mannequin Author

    mgdelmonte mannequin commented Jun 3, 2015

    I don't want to speak out of school and you guys certainly know what you're doing, but it seems a shame to go through these gyrations -- lookahead plus "unreading" lines -- only to preserve the ability to parse email headers, when HTTP really does follow a different spec. My suggestion would be to examine the header and decide, if it's HTTP, to just ignore nonconforming lines; and if it's email, then the problem is already solved (as email doesn't have encoding rules that would cause problems later).

    My fear would be that you'll eventually get that nonconforming line with leading whitespace, which will lead right back to the same error.

    @bitdancer
    Copy link
    Member

    No, the point is to do "best practical" error recovery when faced with dirty data that may be dirty in various ways, and it doesn't really matter whether it is http headers or email headers. A line with leading whitespace is treated as part of the preceding header line now, and this is the way it should behave, since the older http standards adopted that behavior from rfc822. You will note that the standard referenced by Ian is explicit about that, in the obs-fold clause. That is, we are required by the standard to support that behavior, which is why I posit that the best recovery is to assume an invalid line followed by what look like headers is in fact an incorrectly folded obs-fold continuation line. That this will also conform to the email standard is a not-accidental consequence of how these standards evolved. (That is, email and http header handling are *not* "different" specs in the sense of being disjoint, they are derivatives of a common ancestor spec and some effort is spent keeping them interoperable, to my understanding.)

    @vadmium
    Copy link
    Member

    vadmium commented Jun 3, 2015

    Regarding the suggested fix for Python 2, make sure it does not prematurely end the parsing on empty folded lines (having only tabs and spaces in them). E.g. according to RFC 7230 this should be a single header field:

    b"Header: obsolete but\r\n"
    b" \r\n"
    b" still valid\r\n"

    I suspect the RFC doesn’t say anything specifically about this case. In general the guidance seems to be things like:

    • User agents should be tolerant of errors received in the protocol
    • Proxies should fix up errors when forwarding messages upstream
    • Servers should often reject errors in requests with 400 Bad Request (presumably to avoid the possibility of a downstream proxy being tricked by the protocol error and not triggering some security filter)

    In the case of the bank web site, the last lines of the header are:

    X-Frame-Options: SAMEORIGIN\r\n
    Set-Cookie: mb-CookieP=; HttpOnly; \r\n
    Secure\r\n
    Set-Cookie: mb-CookieP=; HttpOnly; Secure\r\n
    \r\n

    It is obvious that this case could be treated as a folded (continuation) line. But in general I think it would be better to ignore the erroneous line, or to record it as a defect so that the server module or other user can check it.

    Looking at the Python 3 code, both the client and server call http.client._parse_headers(), which sensibly reads each line until it sees a blank line (Lib/http/client.py:197). But then after jumping through some more hoops we parse it again until we find a line that fails the regular expression at Lib/email/feedparser.py:37. The remaining lines become the “payload” of the HTTP header:

    >>> with urlopen("http://www.merrickbank.com/") as response:
    ...     response.info().get_payload()
    ... 
    'Secure\r\nSet-Cookie: mb-CookieP=; HttpOnly; Secure\r\n\r\n'

    What might be nice is a way to reuse the email header field parsing code, without worrying about the “From” line stuff, or the payload stuff.

    @bitdancer
    Copy link
    Member

    Since the email package has the correct logic for handling the blank continuation line case (even in Python2) (because, again, that derives from the original email standard), it might be reasonable to use feedparser's headersonly mode. If necessary we can introduce a (private) variation that actually stops parsing the stream at the end of the headers. (This would be a non-public API on feedparser for 2.7 and 3.4/5, and a public one in 3.6.) Since feedparser was itself a rewrite of the mimetools code httplib is using and theoretically backward compatible with it in behavior, this should not in theory introduce any behavior changes (famous last words?).

    The problem here is the good chance that someone is depending on the internal implementation of HTTPMessage. So, I'm not at all sure about this suggestion.

    And yes, the email package in python3 records header defects. However, I think it is better to treat the no-leading-blank non-header as an incorrectly folded continuation line rather than discarding the data. The data is in the headers section for *some* reason, and IMO that is the only reasonable guess as to why.

    @Lukasa
    Copy link
    Mannequin

    Lukasa mannequin commented Jun 3, 2015

    It is obvious that this case could be treated as a folded (continuation) line. But in general I think it would be better to ignore the erroneous line, or to record it as a defect so that the server module or other user can check it.

    Just to clarify, in an instance very similar to this one this would be *terrible* advice. The token that would be lost here is the 'Secure' field on the cookie, which is an extremely important token to have: if we don't correctly parse it, we run the risk of sending the cookie on plaintext connections.

    Discarding data is the problem, and while discarding *less* data is an improvement, it would be good if we could resolve this problem in such a way that we'd have correctly parsed this header.

    Generally speaking, if we treat these as continuation lines I think we have the best change of making a useful header out of this mess.

    @Lukasa
    Copy link
    Mannequin

    Lukasa mannequin commented Jun 3, 2015

    While we're here and I'm recommending to drop as little data as possible: we need to be really careful about not exposing ourselves to any kind of data smuggling attack here.

    It's really important that we don't let attackers construct bodies of requests or responses that will cause us to misinterpret header blocks. It's therefore going to be really tricky to balance those concerns.

    @mgdelmonte
    Copy link
    Mannequin Author

    mgdelmonte mannequin commented Jun 3, 2015

    Given that obs-fold is technically valid, then can I recommend reading the entire header first (reading to the first blank line) and then tokenizing the individual headers using a regular expression rather than line by line? That would solve the problem more elegantly and easily than attempting to read lines on the fly and then "unreading" the nonconforming lines.

    Here's my recommendation:

        def readheaders(self):
            self.dict = {}
            self.unixfrom = ''
            self.headers = hlist = []
            self.status = ''
            # read entire header (read until first blank line)
            while True:
                line = self.fp.readline(_MAXLINE+1)
                if not line:
                    self.status = 'EOF in headers'
                    break
                if len(line) > _MAXLINE:
                    raise LineTooLong("header line")
                hlist.append(line)
                if line in ('\n', '\r\n'):
                    break
                if len(hlist) > _MAXHEADERS:
                    raise HTTPException("got more than %d headers" % _MAXHEADERS)
            # reproduce and parse as string
            header = "\n".join(hlist)
            self.headers = re.findall(r"[^ \n][^\n]+\n(?: +[^\n]+\n)*", header)
            firstline = True
            for line in self.headers:
                if firstline and line.startswith('From '):
                    self.unixfrom = self.unixfrom + line
                    continue
                firstline = False
                if ':' in line:
                    k,v = line.split(':',1)
                    self.addheader(k, re.sub("\n +", " ", v.strip()))
                else:
                    self.status = 'Non-header line where header expected' if self.dict else 'No headers'

    I think this handles everything you're trying to do. I don't understand the unixfrom bit, but I think I have it right.

    As for Cory's concern re: smuggling, _MAXLINE and _MAXHEADERS should help with that. The regexp guarantees that every line plus continuation appears as a single header.

    I use re.sub("\n +", " ", v.strip()) to clean the value and remove the continuation.

    @mgdelmonte
    Copy link
    Mannequin Author

    mgdelmonte mannequin commented Jun 3, 2015

    ... or perhaps

    if ':' in line and line[0] != ':':

    to avoid the colon-as-first-char bug that plagued this library earlier, though the only ill-effect of leaving it alone would be a header with a blank key; not the end of the world.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 29, 2015

    Since the Python 2 and Python 3 branches are different, two different patches would be needed here. Perhaps they could share common test cases though.

    Michael: I presume your proposal is for Python 2. I don’t understand the re.findall() expression; is there a clearer way to do whatever it is trying to do (or failing that, explain it in a comment)? It looks like you are trying to skip over spaces at the start of the first header field name. Also, it seems to drop support for lines folded with tabs rather than spaces.

    David: The headers-only mode wouldn’t make much difference because it only affects parsing the “payload”. As far as the email package is concerned, the payload should always be empty when used by HTTP’s parse_headers().

    The simplest fix (at least for the Python 3 code) would be to check each line against email.feedparser.headerRE before adding it to the list of lines in the HTTP package. This would prevent the email parser from bailing from the header section early, which is the main problem.

    But that does seem like a bad hack and wouldn’t treat the offending line as a folded line, which Cory and David want. So I guess we need to make the email parser more flexible instead. Maybe a private FeedParser._parse_header_lines() method or something, that replaces feed() and close(), and defers most of the processing directly to _parse_headers().

    @vadmium
    Copy link
    Member

    vadmium commented Jun 13, 2016

    See also bpo-26686; the same problem, but with parsing RFC5322 header fields, rather than HTTP.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 14, 2016

    I made a patch to fix all header section parsing by default in the email module (see bpo-26686). I’m not 100% sure if it is safe in general, but if it is, it would fix this bug.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 12, 2016

    In order to avoid messing too much with the intricacies of the existing email parsing, here is a patch for Python 3 that limits the behaviour changes to the HTTP module. It should fix the bad handling of broken header lines. As a side effect, it should also fix bpo-22233, since it bypasses the offending splitlines() call. I incorporated the test cases from my previous patches for bpo-26686 and bpo-22233.

    I tried to maintain a minimal compatibility with the previous special behaviour for message/* and multipart/* message types, although I didn’t bother trying to emulate e.g. StartBoundaryNotFoundDefect (http.client.parse_headers() doesn’t pass any body to the email parser, so it won’t see any start boundaries.)

    @vadmium
    Copy link
    Member

    vadmium commented Sep 9, 2016

    Updated patch for Python 3 now that bpo-22233 has been fixed.

    @bitdancer
    Copy link
    Member

    I think you can greatly simplify this code by using BytesFeedParser and feeding the input to it line by line. You don't need to gather the headers first, since you control how much you read and how much you send to the parser. The change to email is then the handling of lines without colons. The simplest fix is to add a policy control for whether a line-that-doesn't-look-like-a-header is considered part of the preceding header or treated as part of the body (or, to put it another way, do we really only believe the body starts after a blank line or not). You can make that a setting on Policy, and derive the policy used here in http from Compat32, so as to otherwise minimize the behavior changes. If we can't get this done by the beta cutoff on Monday, we can make the new policy setting start with an '_' in 3.6, since the point here is the bug fix, not the feature; and then make it public in 3.7 (since I think it is useful for more than just http).

    @bitdancer
    Copy link
    Member

    I'm not sure it is a good idea to backport this to 2.7, but if you want to, the same fix can be made in a more hackish way on 2.7 by putting a private variable on FeedParser to control the header parsing behavior.

    @bitdancer
    Copy link
    Member

    Hmm. Or maybe the latin-1 decode and FeedParser is better, since with BytesFeedParser and non-ascii you get Header objects, which you don't want. Either way, there's no TextIOWrapper involved.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 9, 2016

    I guess we could add this secret policy flag that the email parser checks. The solution should still be applied as a bug fix to 3.5 as well as 3.6+. I would have to make the flag “very” unique, to reduce the chance of it breaking user code. I.e. adding policy._strict_end_of_headers might interfere with somebody’s user-defined policy class, but maybe policy.__strict_end_of_headers__ is safe, since user code is not supposed to invent such attribute names. Or do you think it “_strict_end_of_headers” name is fair game since the policy stuff is/was provisional?

    It would be good to fix 2.7. This bug, and the duplicate report, were both originally reported for 2.7. I haven’t looked at the situation there in detail, but it looks completely different. HTTPMessage is based on the “mimetools” package, and there is a lot more parsing code already in “httplib”.

    @bitdancer
    Copy link
    Member

    Oh, yes, I forgot 2.7 was using the older code.

    Sure, that flag name sounds fine. I'm not worrying about policy flag name collisions, but perhaps I should be. How about _py_strict_end_of_headers? I don't really care what it is named for the bug fix, I'll think about the exposed name for 3.7.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 10, 2016

    Here is a fix using a policy flag. I called it “_py_body_detached”.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 10, 2016

    Patch for Python 2

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 16, 2016

    New changeset c38e10ad7c89 by Martin Panter in branch '2.7':
    Issue bpo-24363: Continue parsing HTTP header in spite of invalid lines
    https://hg.python.org/cpython/rev/c38e10ad7c89

    @vadmium
    Copy link
    Member

    vadmium commented Sep 16, 2016

    I pushed my Py 2 patch, since it is simpler and does not interfere with other modules. But it would still be good to get feedback on policy-flag.patch for Python 3.

    @bitdancer
    Copy link
    Member

    I will try to review this in the not too distant future. You can ping me if I don't get to it by next Saturday.

    I think I'll probably prefer to call the flag something like _greedy_header_parsing, to reflect a change from assuming we've got body on a non-header-like line to assuming we've still got header if we haven't seen a blank line yet. What we call the private version of the flag doesn't matter all that much, though.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 24, 2017

    Just a minor update with an extra get_payload() test I missed before

    @vadmium vadmium added the 3.7 (EOL) end of life label Jan 24, 2017
    @gboudreau
    Copy link
    Mannequin

    gboudreau mannequin commented Feb 7, 2017

    Any chance this could get reviewed and merged soon? I got hit by a similar issue (see bpo-29445) where the server, which I don't control, sends me invalid HTTP headers, and the prevents all the headers that follow it to not be parsed.
    The latest attached patch fixed the issue for me.
    Thanks.

    @bitdancer
    Copy link
    Member

    Yeah, I'm going to try to get to this this weekend.

    @maxking
    Copy link
    Contributor

    maxking commented Sep 14, 2019

    Martin: Can you please create a PR for the added patch? If you are busy, I can do that for you, just wanted to ask before I do :)

    I am going to remove "easy" label from this issue, which IMO it clearly isn't given 4 years of history to catch up on and a few other related bugs.

    Since some time has passed since the patch was posted, I am going to target this to 3.8 and 3.7 since others are in security fix mode only.

    @maxking maxking added 3.8 only security fixes and removed easy labels Sep 14, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants