Message 272288 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	Lukasa, martin.panter, r.david.murray
Date	2016-08-10.02:27:37
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1470796058.56.0.582916845474.issue27716@psf.upfronthosting.co.za>
In-reply-to

Content
For the test case given, the main problem is actually that a header field is being incorrectly split on a Latin-1 “next line” control code U+0085. The problem is already described under Issue 22233. It looks like I wrote a patch for that a while ago, so it would be good to revisit and see if it is worth applying. Also, the problem would have been less severe if Issue 24363 was addressed; I proposed a patch at Issue 26686 which may help. Here are the relevant header fields returned by the server: >>> conn.request("GET", "/slownik/angielski-polski/") >>> pprint(conn.sock.recv(3333).splitlines(keepends=True)) [b'HTTP/1.1 200 OK\r\n', . . . b'Link: <http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0' b'\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", ' . . . b'Transfer-Encoding: chunked\r\n', b'Content-Type: text/html;charset=UTF-8\r\n', b'\r\n', b'104c\r\n', b'<!DOCTYPE html>\n', . . .] Regarding header value character encoding, revision cb09fdef19f5 is an example of where I assumed a Latin-1 transformation to handle non-ASCII redirect targets. Perhaps just document how the bytes are transformed, and how to get the original bytes back? FWIW UTF-8 is used in RTSP, which is based on HTTP.

For the test case given, the main problem is actually that a header field is being incorrectly split on a Latin-1 “next line” control code U+0085. The problem is already described under Issue 22233. It looks like I wrote a patch for that a while ago, so it would be good to revisit and see if it is worth applying.

Also, the problem would have been less severe if Issue 24363 was addressed; I proposed a patch at Issue 26686 which may help.

Here are the relevant header fields returned by the server:
>>> conn.request("GET", "/slownik/angielski-polski/")
>>> pprint(conn.sock.recv(3333).splitlines(keepends=True))
[b'HTTP/1.1 200 OK\r\n',
 . . .
 b'Link: <http://www.babla.cn/\xe8\x8b\xb1\xe8\xaf\xad-\xe6\xb3\xa2\xe5\x85\xb0'
 b'\xe8\xaf\xad/>; rel="alternate"; hreflang="zh-Hans", '
 . . .
 b'Transfer-Encoding: chunked\r\n',
 b'Content-Type: text/html;charset=UTF-8\r\n',
 b'\r\n',
 b'104c\r\n',
 b'<!DOCTYPE html>\n',
 . . .]

Regarding header value character encoding, revision cb09fdef19f5 is an example of where I assumed a Latin-1 transformation to handle non-ASCII redirect targets. Perhaps just document how the bytes are transformed, and how to get the original bytes back?

FWIW UTF-8 is used in RTSP, which is based on HTTP.

History
Date	User	Action	Args
2016-08-10 02:27:38	martin.panter	set	recipients: + martin.panter, r.david.murray, Lukasa
2016-08-10 02:27:38	martin.panter	set	messageid: <1470796058.56.0.582916845474.issue27716@psf.upfronthosting.co.za>
2016-08-10 02:27:38	martin.panter	link	issue27716 messages
2016-08-10 02:27:37	martin.panter	create