classification
Title: http.client truncates UTF-8 encoded headers
Type: Stage:
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder: http.client splits headers on non-\r\n characters
View: 22233
Assigned To: Nosy List: Lukasa, martin.panter, r.david.murray
Priority: normal Keywords: patch

Created on 2016-08-09 11:33 by Lukasa, last changed 2016-09-18 01:57 by martin.panter.

Files
File name Uploaded Description Edit
header-decoding.patch martin.panter, 2016-09-18 01:57 review
Messages (7)
msg272236 - (view) Author: Cory Benfield (Lukasa) * Date: 2016-08-09 11:33
Originally reported as Requests issue #3485: https://github.com/kennethreitz/requests/issues/3485

On Python 3, http.client uses the email module to parse its HTTP headers. The email module, for better or worse, requires that it parse headers as *text*: that is, that they be decoded from bytes first and then parsed.

This doesn't work for UTF-8 encoded headers. For example, the URL `'http://pl.bab.la/slownik/angielski-polski/'` returns the following Link header, encoded as UTF-8: `Link: <http://www.babla.cn/英语-波兰语/>; rel="alternate"; hreflang="zh-Hans", <http://cs.bab.la/slovnik/anglicky-polsky/>; rel="alternate"; hreflang="cs", <http://da.bab.la/ordbog/engelsk-polsk/>; rel="alternate"; hreflang="da", <http://de.bab.la/woerterbuch/englisch-polnisch/>; rel="alternate"; hreflang="de", <http://www.babla.gr/αγγλικα-πολωνικα/>; rel="alternate"; hreflang="el", <http://en.bab.la/dictionary/english-polish/>; rel="alternate"; hreflang="en", <http://eo.bab.la/vortaro/angla-pola/>; rel="alternate"; hreflang="eo", <http://es.bab.la/diccionario/ingles-polaco/>; rel="alternate"; hreflang="es", <http://fi.bab.la/sanakirja/englanti-puola/>; rel="alternate"; hreflang="fi", <http://fr.bab.la/dictionnaire/anglais-polonais/>; rel="alternate"; hreflang="fr", <http://www.babla.in/अंग्रेज़ी-पोलिश/>; rel="alternate"; hreflang="hi", <http://hu.bab.la/szótár/angol-lengyel/>; rel="alternate"; hreflang="hu", <http://www.babla.co.id/bahasa-inggris-bahasa-polandia/>; rel="alternate"; hreflang="id", <http://it.bab.la/dizionario/inglese-polacco/>; rel="alternate"; hreflang="it", <http://ja.bab.la/辞書/英語-ポーランド語/>; rel="alternate"; hreflang="ja", <http://www.babla.kr/영어-폴란드어/>; rel="alternate"; hreflang="ko", <http://nl.bab.la/woordenboek/engels-pools/>; rel="alternate"; hreflang="nl", <http://www.babla.no/engelsk-polsk/>; rel="alternate"; hreflang="no", <http://pl.bab.la/slownik/angielski-polski/>; rel="alternate"; hreflang="pl", <http://pt.bab.la/dicionario/ingles-polones/>; rel="alternate"; hreflang="pt", <http://ro.bab.la/dictionar/engleza-poloneza/>; rel="alternate"; hreflang="ro", <http://www.babla.ru/английский-польский/>; rel="alternate"; hreflang="ru", <http://sv.bab.la/lexikon/engelsk-polsk/>; rel="alternate"; hreflang="sv", <http://sw.bab.la/kamusi/kiingereza-kipolishi/>; rel="alternate"; hreflang="sw", <http://www.babla.co.th/english-polish/>; rel="alternate"; hreflang="th", <http://tr.bab.la/sozluk/ingilizce-lehce/>; rel="alternate"; hreflang="tr", <http://www.babla.vn/tieng-anh-tieng-ba-lan/>; rel="alternate"; hreflang="vi"`.

When decoded using ISO-8859-1, this header gets truncated and this also causes the header block parsing to stop. This means that we don't see the Content-Length header, causing the HTTP client to wait for connection closure to consider the body terminated.

Really the only correct fix for this is for http.client to stop insisting that the headers be decoded before they are parsed, and instead to decode *after*. That way, at least, user code can recover the headers and handle them more sensibly.
msg272237 - (view) Author: Cory Benfield (Lukasa) * Date: 2016-08-09 11:35
Simple repro case:

    import http.client
    conn = http.client.HTTPConnection('pl.bab.la')
    conn.request("GET", '/slownik/angielski-polski/')
    resp = conn.getresponse()
    resp.read()  # Hangs here
msg272246 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-08-09 13:35
utf-8 headers are contrary to the http spec, aren't they?  Or has that changed?  (It's been a long time since I've looked at any http RFCs.)

This could be fixed by using SMTPUTF8 mode when parsing the headers, which in theory ought to be backward compatible.  We could make SMTPUTF8 the default for email.policy.http, if this is correct per the RFCs.
msg272250 - (view) Author: Cory Benfield (Lukasa) * Date: 2016-08-09 13:49
Honestly, David, everything's a mess on this front. The authoritative document here is RFC 7230 Section 3.2.4 (https://tools.ietf.org/html/rfc7230#section-3.2.4). The last paragraph of that section reads:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

In the case of http.client, actually maps pretty closely to Python 3's bytes object: header field values are basically ASCII + arbitrary opaque bytes. While UTF-8 is not strictly called out as allowed, neither is it called out as forbidden.

In this case, I'd say that there's no need to be too pedantic about Latin 1 at this stage in the pipeline. Python 3 is welcome to decode using Latin 1 *after* the header block has been split, because at least then it can be fixed up due to the round-tripping nature of Latin 1. But doing it here seems to just confuse the email parser.
msg272254 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-08-09 14:11
Well, email will happily parse bytes and treat the non-ascii data as opaque (though it does record errors in an internal data structure), but the python3 http api expects the parsed headers to be strings when you access them, so you'd just hit the decoding problem at that point rather than earlier.

This is a hard problem. Since headers *can* be latin1 (I'd forgotten that) SMTPUTF8 won't work.  We are stuck against the problem that python makes a careful distinction between bytes and string, but http does not.

In theory we could pass bytes to email, and then provide a new API for getting at the "raw" (bytes) header so you can decode it however you want.  That runs into backward compatibility problems, though, since we currently do decode from latin-1 and many programs are probably relying on that.  

Throwing out an idea here: maybe having the http policy decode the parsed bytes header from latin-1 when headers are accessed through the normal API would preserve backward compatibility.  I'm not too worried about back-compat in the http policy, since it is provisional until 3.6 comes out and I doubt anyone is currently using it.
msg272288 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-10 02:27
For the test case given, the main problem is actually that a header field is being incorrectly split on a Latin-1 “next line” control code U+0085. The problem is already described under Issue 22233. It looks like I wrote a patch for that a while ago, so it would be good to revisit and see if it is worth applying.

Also, the problem would have been less severe if Issue 24363 was addressed; I proposed a patch at Issue 26686 which may help.

Here are the relevant header fields returned by the server:
>>> conn.request("GET", "/slownik/angielski-polski/")
>>> pprint(conn.sock.recv(3333).splitlines(keepends=True))
[b'HTTP/1.1 200 OK\r\n',
 . . .
 b'Link: <http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0'
 b'\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", '
 . . .
 b'Transfer-Encoding: chunked\r\n',
 b'Content-Type: text/html;charset=UTF-8\r\n',
 b'\r\n',
 b'104c\r\n',
 b'<!DOCTYPE html>\n',
 . . .]

Regarding header value character encoding, revision cb09fdef19f5 is an example of where I assumed a Latin-1 transformation to handle non-ASCII redirect targets. Perhaps just document how the bytes are transformed, and how to get the original bytes back?

FWIW UTF-8 is used in RTSP, which is based on HTTP.
msg276867 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-18 01:57
Thanks to the fix for Issue 22233, now the response is parsed more sensibly, and the body can be read. The 0x85 byte now gets decoded with Latin-1:

>>> print(ascii(resp.getheader("Link")[:100]))
'<http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", <http://cs.bab.la/slov'

Here is a patch to document how to get the original bytes back (by “encoding” to Latin-1). Other than that, I don’t think there is much left to do for this bug.
History
Date User Action Args
2016-09-18 01:57:01martin.pantersetfiles: + header-decoding.patch
keywords: + patch
messages: + msg276867
2016-08-10 02:27:38martin.pantersetsuperseder: http.client splits headers on non-\r\n characters

messages: + msg272288
nosy: + martin.panter
2016-08-09 14:11:38r.david.murraysetmessages: + msg272254
2016-08-09 13:49:52Lukasasetmessages: + msg272250
2016-08-09 13:35:11r.david.murraysetnosy: + r.david.murray
messages: + msg272246
2016-08-09 11:35:30Lukasasetmessages: + msg272237
2016-08-09 11:33:16Lukasacreate