Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.client splits headers on non-\r\n characters #66429

Closed
scharron mannequin opened this issue Aug 20, 2014 · 13 comments
Closed

http.client splits headers on non-\r\n characters #66429

scharron mannequin opened this issue Aug 20, 2014 · 13 comments
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@scharron
Copy link
Mannequin

scharron mannequin commented Aug 20, 2014

BPO 22233
Nosy @warsaw, @gpshead, @ezio-melotti, @dhellmann, @bitdancer, @vadmium, @serhiy-storchaka, @demianbrecht
Files
  • crlf-headers.patch
  • crlf-headers2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-09-07.21:49:00.246>
    created_at = <Date 2014-08-20.10:02:33.816>
    labels = ['type-bug', 'library', 'expert-email']
    title = 'http.client splits headers on non-\\r\\n characters'
    updated_at = <Date 2016-09-07.22:44:26.851>
    user = 'https://bugs.python.org/scharron'

    bugs.python.org fields:

    activity = <Date 2016-09-07.22:44:26.851>
    actor = 'martin.panter'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-09-07.21:49:00.246>
    closer = 'r.david.murray'
    components = ['Library (Lib)', 'email']
    creation = <Date 2014-08-20.10:02:33.816>
    creator = 'scharron'
    dependencies = []
    files = ['41177', '44442']
    hgrepos = []
    issue_num = 22233
    keywords = ['patch']
    message_count = 13.0
    messages = ['225562', '225565', '246569', '255440', '255460', '255518', '273983', '273988', '274013', '274840', '274899', '274900', '274903']
    nosy_count = 11.0
    nosy_names = ['barry', 'gregory.p.smith', 'ezio.melotti', 'doughellmann', 'r.david.murray', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'demian.brecht', 'scharron', 'Clay Gerrard']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22233'
    versions = ['Python 3.5', 'Python 3.6']

    @scharron
    Copy link
    Mannequin Author

    scharron mannequin commented Aug 20, 2014

    In some cases, the headers from http.client (that uses email.feedparser) splits headers at wrong separators. The bug is from the use of str.splitlines (in email.feedparser) that splits on other characters than \r\n as it should. (See bug http://bugs.python.org/issue22232)

    To reproduce the bug :

    import http.client
    c = http.client.HTTPSConnection("graph.facebook.com")
    c.request("GET", "/%C4%85", None, {"test": "\x85"})
    r = c.getresponse()
    print(r.headers)
    print(r.headers.keys())
    print(r.headers.get("WWW-Authenticate"))

    As you can see, the WWW-Authenticate is wrongly parsed (it misses its final "), and therefore the rest of the headers are ignored.

    @scharron scharron mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Aug 20, 2014
    @scharron
    Copy link
    Mannequin Author

    scharron mannequin commented Aug 20, 2014

    A consequence of this bug is that r.read() blocks until a timeout occurs since the content-length header is not interpreted (I think this is related to the HTTPResponse.__init__ comment)

    @gpshead
    Copy link
    Member

    gpshead commented Jul 10, 2015

    The obvious fix seems to be to not use splitlines but explicitly split on the allowed characters for ASCII based protocols and formats that only want \r and \n to be considered.

    I don't think we can rightfully change the unicode splitlines behavior.

    @gpshead gpshead changed the title http.client splits headers on none-\r\n characters http.client splits headers on non-\r\n characters Jul 10, 2015
    @vadmium
    Copy link
    Member

    vadmium commented Nov 26, 2015

    For the record, this is a simplified version of the original scenario, showing the low-level HTTP protocol:

    >>> request = (
    ...     b"GET /%C4%85 HTTP/1.1\r\n"
    ...     b"Host: graph.facebook.com\r\n"
    ...     b"\r\n"
    ... )
    >>> s = create_connection(("graph.facebook.com", HTTPS_PORT))
    >>> with ssl.wrap_socket(s) as s:
    ...     s.sendall(request)
    ...     response = s.recv(3000)
    ... 
    50
    >>> pprint(response.splitlines(keepends=True))
    [b'HTTP/1.1 404 Not Found\r\n',
     b'WWW-Authenticate: OAuth "Facebook Platform" "not_found" "(#803) Some of the '
     b'aliases you requested do not exist: \xc4\x85"\r\n',
     b'Access-Control-Allow-Origin: *\r\n',
     b'Content-Type: text/javascript; charset=UTF-8\r\n',
     b'X-FB-Trace-ID: H9yxnVcQFuA\r\n',
     b'X-FB-Rev: 2063232\r\n',
     b'Pragma: no-cache\r\n',
     b'Cache-Control: no-store\r\n',
     b'Facebook-API-Version: v2.0\r\n',
     b'Expires: Sat, 01 Jan 2000 00:00:00 GMT\r\n',
     b'X-FB-Debug: 07ouxMl1Z439Ke/YzHSjXx3o9PcpGeZBFS7yrGwTzaaudrZWe5Ef8Z96oSo2dINp'
     b'3GR4q78+1oHDX2pUF2ky1A==\r\n',
     b'Date: Thu, 26 Nov 2015 23:03:47 GMT\r\n',
     b'Connection: keep-alive\r\n',
     b'Content-Length: 147\r\n',
     b'\r\n',
     b'{"error":{"message":"(#803) Some of the aliases you requested do not exist: '
     b'\\u0105","type":"OAuthException","code":803,"fbtrace_id":"H9yxnVcQFuA"}}']

    In my mind, the simplest way forward would be to change the “email” module to only parse lines using the “universal newlines” algorithm. The /Lib/email/feedparser.py module should use StringIO(s, newline="").readlines() rather than s.splitlines(keepends=True). That would mean all email parsing behaviour would change; for instance, given the following message:

    >>> m = email.message_from_string(
    ...     "WWW-Authenticate: abc\x85<body or header?>\r\n"
    ...     "\r\n"
    ... )

    instead of the current behaviour:

    >>> m.items()
    [('WWW-Authenticate', 'abc\x85')]
    >>> m.get_payload()
    '<body or header?>\r\n\r\n'

    it would change to:

    >>> m.items()
    [('WWW-Authenticate', 'abc\x85<body or header?>')]
    >>> m.get_payload()
    ''

    @bitdancer
    Copy link
    Member

    I agree. Can you update the email issue with this suggestion and/or a patch?

    The problem with this, of course is backward compatibility, but since it is more correct per the RFCs, our past policy has been to fix it anyway.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 28, 2015

    David: what is the email issue you mentioned? In the mean time, I am uploading a patch to this issue.

    It seems using StringIO is a bit slower than str.splitlines(). I found a way to optimize building long lines, which compensated a lot of the loss, but this optimization would apply even without using StringIO. My patch makes test.test_email 0.3% slower (the optimization alone would make it 4.4% faster), and test_email.TestFeedParsers.test_long_lines() is 3% slower (optimization 12% faster).

    I also tried two other alternatives to str.splitlines(), but they were both slower than the StringIO technique:

    • _partial is a list of UTF-8 bytes; join and use bytes.splitlines()
    • _partial is a UTF-8 bytearray; use bytearray.splitlines()

    @ClayGerrard
    Copy link
    Mannequin

    ClayGerrard mannequin commented Aug 31, 2016

    BUMP. ;)

    This issue was recently raised as one blocker to OpenStack Object Storage (Swift) finishing our port to python3 (we're hoping to finish adding support >=3.5 by Spring '17 - /me crosses fingers).

    I wonder if someone might confirm or deny the attached patch is likely to be included in the 3.6 timeframe (circa 12/16?) and/or back-ported to the 3.5 series?

    FWIW, I would echo other's sentiment that I would much prefer the implementation to be correct even if there was some worry we might have to choose between further optimization and getting a fix ASAP :D

    Warm Regards,
    -Clay

    @vadmium
    Copy link
    Member

    vadmium commented Aug 31, 2016

    If someone reviews my patch and thinks it is fine, I might commit it. Maybe I can just re-review it myself, now that I have forgotten all the details :)

    If messing with the email package is a problem (performance, or compatibility), another option is to keep the changes to the HTTP module (which I would be more confident in changing on my own). I have another patch for review at bpo-24363 which apparently also fixes this splitlines() bug.

    @bitdancer
    Copy link
    Member

    I'm hoping to take a look at all of these at the core sprint next week.

    @bitdancer
    Copy link
    Member

    This looks good to me. However, although it is by no means obvious, the tests in test_parser are supposed to be for the new policies. When I changed the test to test them another place that needed to fixed was revealed. I've updated the patch accordingly.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 7, 2016

    New changeset 69900c5992c5 by R David Murray in branch '3.5':
    bpo-22233: Only split headers on \r and/or \n, per email RFCs.
    https://hg.python.org/cpython/rev/69900c5992c5

    New changeset 4d2369b901be by R David Murray in branch 'default':
    Merge: bpo-22233: Only split headers on \r and/or \n, per email RFCs.
    https://hg.python.org/cpython/rev/4d2369b901be

    @bitdancer
    Copy link
    Member

    I want to stack another patch on top of this, so I committed it. If you see anything I screwed up, Martin, please let me know.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 7, 2016

    Your modifications look sensible David; thanks for handling this.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants