Title: http.client aborts header parsing upon encountering non-ASCII header names
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, maxking, r.david.murray, tburke
Priority: normal Keywords: patch

Created on 2019-05-29 23:32 by tburke, last changed 2019-06-03 22:20 by tburke.

Pull Requests
URL Status Linked Edit
PR 13788 open tburke, 2019-06-03 22:20
Messages (1)
msg343942 - (view) Author: Tim Burke (tburke) * Date: 2019-05-29 23:32
First, spin up a fairly trivial http server:

    import wsgiref.simple_server
    def app(environ, start_response):
        start_response('200 OK', [
            ('Some-Canonical', 'headers'),
            ('sOme-CRAzY', 'hEaDERs'),
            ('Utf-8-Values', '\xe2\x9c\x94'),
            ('s\xc3\xb6me-UT\xc6\x92-8', 'in the header name'),
            ('some-other', 'random headers'),
        return [b'Hello, world!\n']
    if __name__ == '__main__':
        httpd = wsgiref.simple_server.make_server('', 8000, app)
        while True:

Note that this code works equally well on py2 or py3; the interesting bytes on the wire are the same on either.

Verify the expected response using an independent tool such as curl:

    $ curl -v http://localhost:8000
    *   Trying ::1...
    * TCP_NODELAY set
    * connect to ::1 port 8000 failed: Connection refused
    *   Trying
    * TCP_NODELAY set
    * Connected to localhost ( port 8000 (#0)
    > GET / HTTP/1.1
    > Host: localhost:8000
    > User-Agent: curl/7.64.0
    > Accept: */*
    * HTTP 1.0, assume close after body
    < HTTP/1.0 200 OK
    < Date: Wed, 29 May 2019 23:02:37 GMT
    < Server: WSGIServer/0.2 CPython/3.7.3
    < Some-Canonical: headers
    < sOme-CRAzY: hEaDERs
    < Utf-8-Values: ✔
    < söme-UTƒ-8: in the header name
    < some-other: random headers
    < Content-Length: 14
    Hello, world!
    * Closing connection 0

Check that py2 includes all the same headers:

    $ python2 -c 'import pprint, urllib; resp = urllib.urlopen("http://localhost:8000"); pprint.pprint((dict(,'
    ({'content-length': '14',
      'date': 'Wed, 29 May 2019 23:03:02 GMT',
      'server': 'WSGIServer/0.2 CPython/3.7.3',
      'some-canonical': 'headers',
      'some-crazy': 'hEaDERs',
      'some-other': 'random headers',
      's\xc3\xb6me-ut\xc6\x92-8': 'in the header name',
      'utf-8-values': '\xe2\x9c\x94'},
     'Hello, world!\n')

But py3 *does not*:

    $ python3 -c 'import pprint, urllib.request; resp = urllib.request.urlopen("http://localhost:8000"); pprint.pprint((dict(,'
    ({'Date': 'Wed, 29 May 2019 23:04:09 GMT',
      'Server': 'WSGIServer/0.2 CPython/3.7.3',
      'Some-Canonical': 'headers',
      'Utf-8-Values': 'â\x9c\x94',
      'sOme-CRAzY': 'hEaDERs'},
     b'Hello, world!\n')

Instead, it is missing the first header that has a non-ASCII name as well as all subsequent headers (even if they are all-ASCII). Interestingly, the response body is intact.

This is eventually traced back to email.feedparser's expectation that all headers conform to rfc822 and its assumption that anything that *doesn't* conform must be part of the body:

However, http.client has *already* determined the boundary between headers and body in parse_headers, and sent everything that it thinks is headers to the parser:
Date User Action Args
2019-06-03 22:20:59tburkesetkeywords: + patch
stage: test needed -> patch review
pull_requests: + pull_request13672
2019-05-30 07:09:27SilentGhostsetversions: - Python 3.5, Python 3.6
nosy: + barry, r.david.murray, maxking

components: + Library (Lib)
type: behavior
stage: test needed
2019-05-29 23:32:09tburkecreate