Issue 37093: http.client aborts header parsing upon encountering non-ASCII header names

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81274

classification

Title:	http.client aborts header parsing upon encountering non-ASCII header names
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	barry, corona10, maxking, r.david.murray, tburke
Priority:	normal	Keywords:	patch

Created on 2019-05-29 23:32 by tburke, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 13788	open	tburke, 2019-06-03 22:20

Messages (2)
msg343942 - (view)	Author: Tim Burke (tburke) *	Date: 2019-05-29 23:32
First, spin up a fairly trivial http server: import wsgiref.simple_server def app(environ, start_response): start_response('200 OK', [ ('Some-Canonical', 'headers'), ('sOme-CRAzY', 'hEaDERs'), ('Utf-8-Values', '\xe2\x9c\x94'), ('s\xc3\xb6me-UT\xc6\x92-8', 'in the header name'), ('some-other', 'random headers'), ]) return [b'Hello, world!\n'] if __name__ == '__main__': httpd = wsgiref.simple_server.make_server('', 8000, app) while True: httpd.handle_request() Note that this code works equally well on py2 or py3; the interesting bytes on the wire are the same on either. Verify the expected response using an independent tool such as curl: $ curl -v http://localhost:8000 * Trying ::1... * TCP_NODELAY set * connect to ::1 port 8000 failed: Connection refused * Trying 127.0.0.1... * TCP_NODELAY set * Connected to localhost (127.0.0.1) port 8000 (#0) > GET / HTTP/1.1 > Host: localhost:8000 > User-Agent: curl/7.64.0 > Accept: / > * HTTP 1.0, assume close after body < HTTP/1.0 200 OK < Date: Wed, 29 May 2019 23:02:37 GMT < Server: WSGIServer/0.2 CPython/3.7.3 < Some-Canonical: headers < sOme-CRAzY: hEaDERs < Utf-8-Values: ✔ < söme-UTƒ-8: in the header name < some-other: random headers < Content-Length: 14 < Hello, world! * Closing connection 0 Check that py2 includes all the same headers: $ python2 -c 'import pprint, urllib; resp = urllib.urlopen("http://localhost:8000"); pprint.pprint((dict(resp.info().items()), resp.read()))' ({'content-length': '14', 'date': 'Wed, 29 May 2019 23:03:02 GMT', 'server': 'WSGIServer/0.2 CPython/3.7.3', 'some-canonical': 'headers', 'some-crazy': 'hEaDERs', 'some-other': 'random headers', 's\xc3\xb6me-ut\xc6\x92-8': 'in the header name', 'utf-8-values': '\xe2\x9c\x94'}, 'Hello, world!\n') But py3 does not: $ python3 -c 'import pprint, urllib.request; resp = urllib.request.urlopen("http://localhost:8000"); pprint.pprint((dict(resp.info().items()), resp.read()))' ({'Date': 'Wed, 29 May 2019 23:04:09 GMT', 'Server': 'WSGIServer/0.2 CPython/3.7.3', 'Some-Canonical': 'headers', 'Utf-8-Values': 'â\x9c\x94', 'sOme-CRAzY': 'hEaDERs'}, b'Hello, world!\n') Instead, it is missing the first header that has a non-ASCII name as well as all subsequent headers (even if they are all-ASCII). Interestingly, the response body is intact. This is eventually traced back to email.feedparser's expectation that all headers conform to rfc822 and its assumption that anything that doesn't conform must be part of the body: https://github.com/python/cpython/blob/v3.7.3/Lib/email/feedparser.py#L228-L236 However, http.client has already determined the boundary between headers and body in parse_headers, and sent everything that it thinks is headers to the parser: https://github.com/python/cpython/blob/v3.7.3/Lib/http/client.py#L193-L214
msg358833 - (view)	Author: Tim Burke (tburke) *	Date: 2019-12-23 20:35
Note that because http.server uses http.client to parse headers [0], this can pose a request-smuggling vector depending on how you've designed your system. For example, you might have a storage system with a user-facing HTTP server that is in charge of * authenticating and authorizing users, * determining where data should be stored, and * proxying the user request to the backend and a separate (unauthenticated) HTTP server for actually storing that data. If the proxy and backend are running different versions of CPython (say, because you're trying to upgrade an existing py2 cluster to run on py3), they may disagree about where the request begins and ends -- potentially causing the backend to process multiple requests, only the first of which was authorized. See, for example, https://bugs.launchpad.net/swift/+bug/1840507 For what it's worth, most http server libraries (that I tested; take it with a grain of salt) seem to implement their own header parsing. Eventlet was a notable exception [1]. [0] https://github.com/python/cpython/blob/v3.8.0/Lib/http/server.py#L336-L337 [1] https://github.com/eventlet/eventlet/pull/574

History
Date	User	Action	Args
2022-04-11 14:59:15	admin	set	github: 81274
2022-01-22 03:59:08	corona10	set	nosy: + corona10
2019-12-23 20:35:39	tburke	set	messages: + msg358833
2019-06-03 22:20:59	tburke	set	keywords: + patch stage: test needed -> patch review pull_requests: + pull_request13672
2019-05-30 07:09:27	SilentGhost	set	versions: - Python 3.5, Python 3.6 nosy: + barry, r.david.murray, maxking components: + Library (Lib) type: behavior stage: test needed
2019-05-29 23:32:09	tburke	create