classification
Title: http.server query string handling is incorrect and inefficient
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, martin.panter, orsenthil, v+python
Priority: normal Keywords:

Created on 2012-04-12 19:53 by v+python, last changed 2015-10-02 03:24 by martin.panter.

Messages (5)
msg158166 - (view) Author: Glenn Linderman (v+python) * Date: 2012-04-12 19:53
A URL potentially consists of four parts: path, PATH_INFO, anchor, QUERY_STRING.  The syntax is roughly:

/path/parts/cgi-script/path/info/parts#anchor?query-string

where # and ? characters play key roles.

is_cgi not-so-cleverly passes the whole request to _url_collapse_path for normalization, resulting in inappropriately normalizing the anchor and query-string as well as appropriately normalizing the path and path info parts.  Consider that there is no syntax restrictions preventing a query string or anchor from containing / and characters. I contrived such strings, and observed that they were passed to _url_collapse_path and were inappropriately "normalized".

Now for non-CGI usage, both the anchor and query-string are ignored by server.py (see translate_path which has code to chop them off). For non-CGI usage, translate_path is called just once, so this isn't particularly inefficient nor is it incorrect. However, for CGI usage it can be both.

It is not clear to me why browsers even send the anchor to the server, but they do. Perhaps there are servers that process it, but this one doesn't, in any case I can find, and I'm unaware of semantics applied by any server.  Therefore, parsing and ignoring it seems appropriate.

Presently, inappropriate normalization of the query-string causes no correctness problem for CGI requests because of issue 14566 which reverts back to the original path.  Should issue 14566 be fixed as recommended, certain query-strings will be mangled by the inappropriate normalization. However, even though there is not presently a correctness problem, there is an efficiency problem for CGI requests: the anchor and query-string are both needlessly processed by _url_collapse_path, and potentially repeatedly processed by translate_path in the loop at the top of run_cgi. In fact, it is not exactly clear whether anchors and query-strings containing cleverly crafted / and . and .. combinations might be able to confuse that loop... because translate_path chops them off, but the loop does not!

It seems to me that the appropriate place to separate the anchor and query string would be in parse_request.  Instead of creating self.command, self.path, self.request_version there, I think it would be better to divide the current content of self.path into three parts: self.path, self.anchor, and self.query_string.

self.path, then, would unambiguously contain only path parts, and would be appropriately passed through _url_collapse_path... perhaps even right there in parse_request ... it seems appropriate to normalize the path for reqular files as well as CGI files for all the same reasons.

Reasons:
1) no ability to access files outside the configured realm due to malicious use of .. as a path component -- although it seems translate_path prevents this anyway, in a complex manner.
2) proper application of unquote, to allow # and ? characters to exist in path parts, and may also legally be applied to other characters by browser code

Neither of these actions are presently performed for regular files, only for CGI files.

If the processing happens in parse_request, then other code would be affected:

is_cgi would no longer need to call _url_collapse_path, as it would already be done

translate_path would no longer need to parse off query strings or anchors, as it would already be done

translate_path would no longer need to call urllib.parse.unquote, as it would already be done.

run_cgi would no longer need to parse anchor or query, but instead could reference query as self.query_string
msg158199 - (view) Author: Glenn Linderman (v+python) * Date: 2012-04-13 08:02
A bit of experimentation indicates that for regular file access, there probably is no security problem, but bad paths will look in weird places, and if they find a file of the right name, will return it. It would be much better to diagnose such paths.  There is even a comment to that effect in translate_path.
msg158202 - (view) Author: Glenn Linderman (v+python) * Date: 2012-04-13 08:58
I finally understand the purpose of the checks in translate path...
Basically, translate path is concatenating the URL path to the current directory (because that is considered the root for Web service by this
server).  But along the way, it does normalization (redundantly compared to _url_collapse_path, but for a different code path, and sadly, using a different algorithm that gets different results), and os-specific checks.

For the os-specific checks, it does a couple splits to see if each path component contains a drive letter or "character other than / that is used as a directory separator", or contains "." or ".." (os specific versions).

It doesn't check for os-specific illegal filename characters (but of course they will not match existing files on the OS, so that eventually would cause a 404).

Such checks are probably best done only on path components that are actually traversed, the only problem is that increasingly large subsets of the path are passed to translate_path by run_cgi so the net effect is an O(n-squared) performance characteristic: most actual paths do not get too long, happily, but it is still inefficient.

Factoring out the checks into a function that could be called by translate path or run_cgi might be appropriate, and then run_cgi could call the new function piece by piece instead of calling translate_path at all.  It would also be good to make translate path produce an error if "drive" or "head" are ever non-empty.
msg158217 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-04-13 17:42
> /path/parts/cgi-script/path/info/parts#anchor?query-string

This should be: /path/parts/cgi-script/path/info/parts?query-string#anchor
msg252080 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-02 03:24
I think the decision on how to parse the “path” attribute has to be left up to each request handler implementation, rather than being done blindly in BaseHTTPRequestHandler.parse_request(). The reason is there are various forms of HTTP request target that don’t actually have a path:

OPTIONS *  # Asterisk form
OPTIONS http://example.net  # Absolute form with no path nor query
CONNECT example.net:443  # Authority form

I agree that the current situation is far from ideal, and a function or method to parse a URL path should be very useful. Functions urlparse() and urlsplit() can already help with splitting off the query.

Currently there is double percent-decoding going on in the CGI server. “GET /cgi-bin/%2574est.py” decodes to “test.py”, when it should only decode as “%74est.py”. This is probably a side-effect of fixing Issue 14566 and related bugs.

Also, see Issue 5714 about making this function a public API.
History
Date User Action Args
2015-10-02 03:24:37martin.pantersetversions: + Python 3.4, Python 3.5, Python 3.6, - Python 3.2, Python 3.3
nosy: + martin.panter

messages: + msg252080

stage: needs patch
2012-04-16 17:51:45Jim.Jewettsettitle: http.server query string handling incorrect and inefficient -> http.server query string handling is incorrect and inefficient
2012-04-13 17:43:00eric.araujosettitle: http/server.py query string handling incorrect, inefficient -> http.server query string handling incorrect and inefficient
versions: - Python 2.6, Python 3.1
2012-04-13 17:42:44eric.araujosetnosy: + eric.araujo
messages: + msg158217
2012-04-13 08:58:24v+pythonsetmessages: + msg158202
2012-04-13 08:02:47v+pythonsetmessages: + msg158199
2012-04-12 19:53:06v+pythoncreate