Issue 27716: http.client truncates UTF-8 encoded headers

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/71903

classification

Title:	http.client truncates UTF-8 encoded headers
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:	http.client splits headers on non-\r\n characters View: 22233
Assigned To:		Nosy List:	Lukasa, martin.panter, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2016-08-09 11:33 by Lukasa, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
header-decoding.patch	martin.panter, 2016-09-18 01:57		review

Messages (7)
msg272236 - (view)	Author: Cory Benfield (Lukasa) *	Date: 2016-08-09 11:33
Originally reported as Requests issue #3485: https://github.com/kennethreitz/requests/issues/3485 On Python 3, http.client uses the email module to parse its HTTP headers. The email module, for better or worse, requires that it parse headers as text: that is, that they be decoded from bytes first and then parsed. This doesn't work for UTF-8 encoded headers. For example, the URL `'http://pl.bab.la/slownik/angielski-polski/'` returns the following Link header, encoded as UTF-8: `Link: <http://www.babla.cn/英语-波兰语/>; rel="alternate"; hreflang="zh-Hans", <http://cs.bab.la/slovnik/anglicky-polsky/>; rel="alternate"; hreflang="cs", <http://da.bab.la/ordbog/engelsk-polsk/>; rel="alternate"; hreflang="da", <http://de.bab.la/woerterbuch/englisch-polnisch/>; rel="alternate"; hreflang="de", <http://www.babla.gr/αγγλικα-πολωνικα/>; rel="alternate"; hreflang="el", <http://en.bab.la/dictionary/english-polish/>; rel="alternate"; hreflang="en", <http://eo.bab.la/vortaro/angla-pola/>; rel="alternate"; hreflang="eo", <http://es.bab.la/diccionario/ingles-polaco/>; rel="alternate"; hreflang="es", <http://fi.bab.la/sanakirja/englanti-puola/>; rel="alternate"; hreflang="fi", <http://fr.bab.la/dictionnaire/anglais-polonais/>; rel="alternate"; hreflang="fr", <http://www.babla.in/अंग्रेज़ी-पोलिश/>; rel="alternate"; hreflang="hi", <http://hu.bab.la/szótár/angol-lengyel/>; rel="alternate"; hreflang="hu", <http://www.babla.co.id/bahasa-inggris-bahasa-polandia/>; rel="alternate"; hreflang="id", <http://it.bab.la/dizionario/inglese-polacco/>; rel="alternate"; hreflang="it", <http://ja.bab.la/辞書/英語-ポーランド語/>; rel="alternate"; hreflang="ja", <http://www.babla.kr/영어-폴란드어/>; rel="alternate"; hreflang="ko", <http://nl.bab.la/woordenboek/engels-pools/>; rel="alternate"; hreflang="nl", <http://www.babla.no/engelsk-polsk/>; rel="alternate"; hreflang="no", <http://pl.bab.la/slownik/angielski-polski/>; rel="alternate"; hreflang="pl", <http://pt.bab.la/dicionario/ingles-polones/>; rel="alternate"; hreflang="pt", <http://ro.bab.la/dictionar/engleza-poloneza/>; rel="alternate"; hreflang="ro", <http://www.babla.ru/английский-польский/>; rel="alternate"; hreflang="ru", <http://sv.bab.la/lexikon/engelsk-polsk/>; rel="alternate"; hreflang="sv", <http://sw.bab.la/kamusi/kiingereza-kipolishi/>; rel="alternate"; hreflang="sw", <http://www.babla.co.th/english-polish/>; rel="alternate"; hreflang="th", <http://tr.bab.la/sozluk/ingilizce-lehce/>; rel="alternate"; hreflang="tr", <http://www.babla.vn/tieng-anh-tieng-ba-lan/>; rel="alternate"; hreflang="vi"`. When decoded using ISO-8859-1, this header gets truncated and this also causes the header block parsing to stop. This means that we don't see the Content-Length header, causing the HTTP client to wait for connection closure to consider the body terminated. Really the only correct fix for this is for http.client to stop insisting that the headers be decoded before they are parsed, and instead to decode after. That way, at least, user code can recover the headers and handle them more sensibly.
msg272237 - (view)	Author: Cory Benfield (Lukasa) *	Date: 2016-08-09 11:35
Simple repro case: import http.client conn = http.client.HTTPConnection('pl.bab.la') conn.request("GET", '/slownik/angielski-polski/') resp = conn.getresponse() resp.read() # Hangs here
msg272246 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-08-09 13:35
utf-8 headers are contrary to the http spec, aren't they? Or has that changed? (It's been a long time since I've looked at any http RFCs.) This could be fixed by using SMTPUTF8 mode when parsing the headers, which in theory ought to be backward compatible. We could make SMTPUTF8 the default for email.policy.http, if this is correct per the RFCs.
msg272250 - (view)	Author: Cory Benfield (Lukasa) *	Date: 2016-08-09 13:49
Honestly, David, everything's a mess on this front. The authoritative document here is RFC 7230 Section 3.2.4 (https://tools.ietf.org/html/rfc7230#section-3.2.4). The last paragraph of that section reads: Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data. In the case of http.client, actually maps pretty closely to Python 3's bytes object: header field values are basically ASCII + arbitrary opaque bytes. While UTF-8 is not strictly called out as allowed, neither is it called out as forbidden. In this case, I'd say that there's no need to be too pedantic about Latin 1 at this stage in the pipeline. Python 3 is welcome to decode using Latin 1 after the header block has been split, because at least then it can be fixed up due to the round-tripping nature of Latin 1. But doing it here seems to just confuse the email parser.
msg272254 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-08-09 14:11
Well, email will happily parse bytes and treat the non-ascii data as opaque (though it does record errors in an internal data structure), but the python3 http api expects the parsed headers to be strings when you access them, so you'd just hit the decoding problem at that point rather than earlier. This is a hard problem. Since headers can be latin1 (I'd forgotten that) SMTPUTF8 won't work. We are stuck against the problem that python makes a careful distinction between bytes and string, but http does not. In theory we could pass bytes to email, and then provide a new API for getting at the "raw" (bytes) header so you can decode it however you want. That runs into backward compatibility problems, though, since we currently do decode from latin-1 and many programs are probably relying on that. Throwing out an idea here: maybe having the http policy decode the parsed bytes header from latin-1 when headers are accessed through the normal API would preserve backward compatibility. I'm not too worried about back-compat in the http policy, since it is provisional until 3.6 comes out and I doubt anyone is currently using it.
msg272288 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-08-10 02:27
For the test case given, the main problem is actually that a header field is being incorrectly split on a Latin-1 “next line” control code U+0085. The problem is already described under Issue 22233. It looks like I wrote a patch for that a while ago, so it would be good to revisit and see if it is worth applying. Also, the problem would have been less severe if Issue 24363 was addressed; I proposed a patch at Issue 26686 which may help. Here are the relevant header fields returned by the server: >>> conn.request("GET", "/slownik/angielski-polski/") >>> pprint(conn.sock.recv(3333).splitlines(keepends=True)) [b'HTTP/1.1 200 OK\r\n', . . . b'Link: <http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0' b'\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", ' . . . b'Transfer-Encoding: chunked\r\n', b'Content-Type: text/html;charset=UTF-8\r\n', b'\r\n', b'104c\r\n', b'<!DOCTYPE html>\n', . . .] Regarding header value character encoding, revision cb09fdef19f5 is an example of where I assumed a Latin-1 transformation to handle non-ASCII redirect targets. Perhaps just document how the bytes are transformed, and how to get the original bytes back? FWIW UTF-8 is used in RTSP, which is based on HTTP.
msg276867 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-09-18 01:57
Thanks to the fix for Issue 22233, now the response is parsed more sensibly, and the body can be read. The 0x85 byte now gets decoded with Latin-1: >>> print(ascii(resp.getheader("Link")[:100])) '<http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", <http://cs.bab.la/slov' Here is a patch to document how to get the original bytes back (by “encoding” to Latin-1). Other than that, I don’t think there is much left to do for this bug.

History
Date	User	Action	Args
2022-04-11 14:58:34	admin	set	github: 71903
2016-09-18 01:57:01	martin.panter	set	files: + header-decoding.patch keywords: + patch messages: + msg276867
2016-08-10 02:27:38	martin.panter	set	superseder: http.client splits headers on non-\r\n characters messages: + msg272288 nosy: + martin.panter
2016-08-09 14:11:38	r.david.murray	set	messages: + msg272254
2016-08-09 13:49:52	Lukasa	set	messages: + msg272250
2016-08-09 13:35:11	r.david.murray	set	nosy: + r.david.murray messages: + msg272246
2016-08-09 11:35:30	Lukasa	set	messages: + msg272237
2016-08-09 11:33:16	Lukasa	create