Issue 33973: HTTP request-line parsing splits on Unicode whitespace

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/78154

classification

Title:	HTTP request-line parsing splits on Unicode whitespace
Type:	behavior	Stage:	patch review
Components:	Library (Lib), Unicode	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, tburke, vstinner
Priority:	normal	Keywords:	patch

Created on 2018-06-26 18:39 by tburke, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 7932	open	tburke, 2018-06-26 18:42

Messages (2)
msg320507 - (view)	Author: Tim Burke (tburke) *	Date: 2018-06-26 18:39
This causes (admittedly, buggy) clients that would work with a Python 2 server to stop working when the server upgrades to Python 3. To demonstrate, run `python2.7 -m SimpleHTTPServer 8027` in one terminal and `curl -v http://127.0.0.1:8027/你好` in another -- curl reports * Trying 127.0.0.1... * TCP_NODELAY set * Connected to 127.0.0.1 (127.0.0.1) port 8027 (#0) > GET /你好 HTTP/1.1 > Host: 127.0.0.1:8027 > User-Agent: curl/7.54.0 > Accept: / > * HTTP 1.0, assume close after body < HTTP/1.0 404 File not found < Server: SimpleHTTP/0.6 Python/2.7.10 < Date: Tue, 26 Jun 2018 17:23:25 GMT < Content-Type: text/html < Connection: close < <head> <title>Error response</title> </head> <body> <h1>Error response</h1> <p>Error code 404. <p>Message: File not found. <p>Error code explanation: 404 = Nothing matches the given URI. </body> * Closing connection 0 ...while repeating the experiment with `python3.6 -m http.server 8036` and `curl -v http://127.0.0.1:8036/你好` gives * Trying 127.0.0.1... * TCP_NODELAY set * Connected to 127.0.0.1 (127.0.0.1) port 8036 (#0) > GET /你好 HTTP/1.1 > Host: 127.0.0.1:8036 > User-Agent: curl/7.54.0 > Accept: / > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <title>Error response</title> </head> <body> <h1>Error response</h1> <p>Error code: 400</p> <p>Message: Bad request syntax ('GET /ä½\xa0å¥½ HTTP/1.1').</p> <p>Error code explanation: HTTPStatus.BAD_REQUEST - Bad request syntax or unsupported method.</p> </body> </html> * Connection #0 to host 127.0.0.1 left intact Granted, a well-behaved client would have quoted the UTF-8 '你好' as '%E4%BD%A0%E5%A5%BD' (in which case everything would have behaved as expected), but RFC 7230 is pretty clear that the request-line should be SP-delimited. While it notes that "recipients MAY instead parse on whitespace-delimited word boundaries and, aside from the CRLF terminator, treat any form of whitespace as the SP separator", it goes on to say that "such whitespace includes one or more of the following octets: SP, HTAB, VT (%x0B), FF (%x0C), or bare CR" with no mention of characters like the (ISO-8859-1 encoded) non-breaking space that caused the 400 response. FWIW, there was a similar unicode-separators-are-not-the-right-separators bug in header parsing a while back: https://bugs.python.org/issue22233
msg320529 - (view)	Author: STINNER Victor (vstinner) *	Date: 2018-06-27 00:42
isspace() is also true for another non-ASCII character: U+0085 (b'\x85'). >>> ''.join(chr(i) for i in range(256) if chr(i).isspace()) '\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0'

History
Date	User	Action	Args
2022-04-11 14:59:02	admin	set	github: 78154
2020-01-25 13:18:38	cheryl.sabella	set	versions: + Python 3.9, - Python 3.4, Python 3.5, Python 3.6
2018-06-27 00:42:06	vstinner	set	messages: + msg320529
2018-06-26 18:42:06	tburke	set	keywords: + patch stage: patch review pull_requests: + pull_request7539
2018-06-26 18:39:29	tburke	create