Message272236
Originally reported as Requests issue #3485: https://github.com/kennethreitz/requests/issues/3485
On Python 3, http.client uses the email module to parse its HTTP headers. The email module, for better or worse, requires that it parse headers as *text*: that is, that they be decoded from bytes first and then parsed.
This doesn't work for UTF-8 encoded headers. For example, the URL `'http://pl.bab.la/slownik/angielski-polski/'` returns the following Link header, encoded as UTF-8: `Link: <http://www.babla.cn/英语-波兰语/>; rel="alternate"; hreflang="zh-Hans", <http://cs.bab.la/slovnik/anglicky-polsky/>; rel="alternate"; hreflang="cs", <http://da.bab.la/ordbog/engelsk-polsk/>; rel="alternate"; hreflang="da", <http://de.bab.la/woerterbuch/englisch-polnisch/>; rel="alternate"; hreflang="de", <http://www.babla.gr/αγγλικα-πολωνικα/>; rel="alternate"; hreflang="el", <http://en.bab.la/dictionary/english-polish/>; rel="alternate"; hreflang="en", <http://eo.bab.la/vortaro/angla-pola/>; rel="alternate"; hreflang="eo", <http://es.bab.la/diccionario/ingles-polaco/>; rel="alternate"; hreflang="es", <http://fi.bab.la/sanakirja/englanti-puola/>; rel="alternate"; hreflang="fi", <http://fr.bab.la/dictionnaire/anglais-polonais/>; rel="alternate"; hreflang="fr", <http://www.babla.in/अंग्रेज़ी-पोलिश/>; rel="alternate"; hreflang="hi", <http://hu.bab.la/szótár/angol-lengyel/>; rel="alternate"; hreflang="hu", <http://www.babla.co.id/bahasa-inggris-bahasa-polandia/>; rel="alternate"; hreflang="id", <http://it.bab.la/dizionario/inglese-polacco/>; rel="alternate"; hreflang="it", <http://ja.bab.la/辞書/英語-ポーランド語/>; rel="alternate"; hreflang="ja", <http://www.babla.kr/영어-폴란드어/>; rel="alternate"; hreflang="ko", <http://nl.bab.la/woordenboek/engels-pools/>; rel="alternate"; hreflang="nl", <http://www.babla.no/engelsk-polsk/>; rel="alternate"; hreflang="no", <http://pl.bab.la/slownik/angielski-polski/>; rel="alternate"; hreflang="pl", <http://pt.bab.la/dicionario/ingles-polones/>; rel="alternate"; hreflang="pt", <http://ro.bab.la/dictionar/engleza-poloneza/>; rel="alternate"; hreflang="ro", <http://www.babla.ru/английский-польский/>; rel="alternate"; hreflang="ru", <http://sv.bab.la/lexikon/engelsk-polsk/>; rel="alternate"; hreflang="sv", <http://sw.bab.la/kamusi/kiingereza-kipolishi/>; rel="alternate"; hreflang="sw", <http://www.babla.co.th/english-polish/>; rel="alternate"; hreflang="th", <http://tr.bab.la/sozluk/ingilizce-lehce/>; rel="alternate"; hreflang="tr", <http://www.babla.vn/tieng-anh-tieng-ba-lan/>; rel="alternate"; hreflang="vi"`.
When decoded using ISO-8859-1, this header gets truncated and this also causes the header block parsing to stop. This means that we don't see the Content-Length header, causing the HTTP client to wait for connection closure to consider the body terminated.
Really the only correct fix for this is for http.client to stop insisting that the headers be decoded before they are parsed, and instead to decode *after*. That way, at least, user code can recover the headers and handle them more sensibly. |
|
Date |
User |
Action |
Args |
2016-08-09 11:33:16 | Lukasa | set | recipients:
+ Lukasa |
2016-08-09 11:33:16 | Lukasa | set | messageid: <1470742396.79.0.719108941731.issue27716@psf.upfronthosting.co.za> |
2016-08-09 11:33:16 | Lukasa | link | issue27716 messages |
2016-08-09 11:33:15 | Lukasa | create | |
|