First, spin up a fairly trivial http server:
import wsgiref.simple_server
def app(environ, start_response):
start_response('200 OK', [
('Some-Canonical', 'headers'),
('sOme-CRAzY', 'hEaDERs'),
('Utf-8-Values', '\xe2\x9c\x94'),
('s\xc3\xb6me-UT\xc6\x92-8', 'in the header name'),
('some-other', 'random headers'),
])
return [b'Hello, world!\n']
if __name__ == '__main__':
httpd = wsgiref.simple_server.make_server('', 8000, app)
while True:
httpd.handle_request()
Note that this code works equally well on py2 or py3; the interesting bytes on the wire are the same on either.
Verify the expected response using an independent tool such as curl:
$ curl -v http://localhost:8000
* Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 8000 failed: Connection refused
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.64.0
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Date: Wed, 29 May 2019 23:02:37 GMT
< Server: WSGIServer/0.2 CPython/3.7.3
< Some-Canonical: headers
< sOme-CRAzY: hEaDERs
< Utf-8-Values: ✔
< söme-UTƒ-8: in the header name
< some-other: random headers
< Content-Length: 14
<
Hello, world!
* Closing connection 0
Check that py2 includes all the same headers:
$ python2 -c 'import pprint, urllib; resp = urllib.urlopen("http://localhost:8000"); pprint.pprint((dict(resp.info().items()), resp.read()))'
({'content-length': '14',
'date': 'Wed, 29 May 2019 23:03:02 GMT',
'server': 'WSGIServer/0.2 CPython/3.7.3',
'some-canonical': 'headers',
'some-crazy': 'hEaDERs',
'some-other': 'random headers',
's\xc3\xb6me-ut\xc6\x92-8': 'in the header name',
'utf-8-values': '\xe2\x9c\x94'},
'Hello, world!\n')
But py3 *does not*:
$ python3 -c 'import pprint, urllib.request; resp = urllib.request.urlopen("http://localhost:8000"); pprint.pprint((dict(resp.info().items()), resp.read()))'
({'Date': 'Wed, 29 May 2019 23:04:09 GMT',
'Server': 'WSGIServer/0.2 CPython/3.7.3',
'Some-Canonical': 'headers',
'Utf-8-Values': 'â\x9c\x94',
'sOme-CRAzY': 'hEaDERs'},
b'Hello, world!\n')
Instead, it is missing the first header that has a non-ASCII name as well as all subsequent headers (even if they are all-ASCII). Interestingly, the response body is intact.
This is eventually traced back to email.feedparser's expectation that all headers conform to rfc822 and its assumption that anything that *doesn't* conform must be part of the body: https://github.com/python/cpython/blob/v3.7.3/Lib/email/feedparser.py#L228-L236
However, http.client has *already* determined the boundary between headers and body in parse_headers, and sent everything that it thinks is headers to the parser: https://github.com/python/cpython/blob/v3.7.3/Lib/http/client.py#L193-L214
|