classification
Title: http.server should support HTTP compression (gzip)
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: martin.panter, quentel, v+python
Priority: normal Keywords:

Created on 2017-06-05 19:42 by quentel, last changed 2017-06-20 12:36 by quentel.

Pull Requests
URL Status Linked Edit
PR 2078 open quentel, 2017-06-10 13:31
Messages (8)
msg295207 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-05 19:42
The server in http.server currently doesn't support HTTP compression.

I propose to implement it in the method send_head() of SimpleHTTPRequestHandler this way : for each GET request, if the request header "Accept-Encoding" is present and includes "gzip" among the possible compression schemes, and if the Content-Type determined by the file extension is in a list compressed_types, then the server sends the "Content-Encoding" response header to "gzip" and send_head() returns a file object with the gzipped value.

compressed_types is an attribute of the SimpleHTTPRequestHandler class and is set by default to ["text/plain", "text/html", "text/css", "text/xml", "text/javascript", "application/javascript", "application/json"].

The implementation is very simple (a few lines of code).

I also propose to modify mimetypes to add the mapping of extension ".json" to "application/json".

I will make a Pull Request on the CPython Github site with these changes.
msg295218 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-06-05 23:22
Why do you want to this? Encoding files on the fly seems out of scope of the SimpleHTTPRequestHandler class to me, but perhaps a more flexible API that could be plugged in by the user could be beneficial.

See xmlrpc.server.SimpleXMLRPCRequestHandler.accept_encodings and related code for an existing implementation.

Did you consider using Transfer-Encoding instead of Content-Encoding?

What do you propose to do with Content-Length? What would happen to code that uses persistent HTTP connections?

There are a few bugs open about this in the client: Issue 1508475 discusses handling compression (especially via Content-Encoding), and Issue 4733 talks about text decoding, which would depend on decoding the Content-Encoding first.
msg295300 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-06 20:20
I propose this as a minor improvement to the built-in server, like the support of browser cache that will be included in Python 3.7 (issue #29654, PR #298). I understand that the server is not supposed to be full-featured, but HTTP compression is widespread, reduces network load and is easy to implement (the code will be very similar to SimpleXMLRPCRequestHandler).

Content-Encoding is used because it's the most simple to implement. Chunked transfer for large amount of data seems to me to be out of the scope of the built-in server.

Content-Length is set to the length of the compressed data. I don't understand your question about persistent connections : the proposal covers a single request / response sequence, it doesn't depend on the underlying TCP connection being reused or not.

From what I understand, issue #1508475 refers to the http client, not server ; and #4733 refers to another meaning of encoding (conversion from characters to bytes with a charset such as utf-8), not to HTTP compression, which unfortunately also uses "encoding" in headers names.
msg295562 - (view) Author: Glenn Linderman (v+python) * Date: 2017-06-09 19:01
I don't understand fully what you are planning here: to pre-compress the files, or to compress on the fly as mentioned by another commenter?

I've implemented, in a CGI behind http.server, both .gz and .br (gzip and brotli) compression, following these rules:

User requests file  xyz

If xyz doesn't exist, then look for xyz.gz or xyz.br.  If one of them exists, then serve it. But if the browser doesn't support gzip or br (as appropriate) then decompress on the fly, otherwise set the appropriate Content-Encoding, and send the compressed file.

This has worked out well. Of course, .br is only supported for https: transfers.  Most browsers support it now, except Apple.
msg295635 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-10 13:18
The compression is done on the fly : if compression criteria are satisfied, the original content is gzipped, either in memory or on a temporary file on disk, depending on the file size.

The gzipped content is not cached, but since the server now supports browser cache, on the next requests for the same file a 304 response will be sent.
msg296247 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-06-18 00:58
I think neither Pierre’s nor Glenn’s implementations should be added to SimpleHTTPRequestHandler. In fact I think most forms of content negotiation are only appropriate for a higher-level server. It seems too far removed from the intention of the class, “directly mapping the directory structure to HTTP requests”.

Another concern with Pierre’s proposal is the delay and memory or disk usage that would be incurred for a large file (e.g. text/plain log file), especially with HEAD requests. I have Linux computers set up with /tmp just held in memory, no disk file system nor swap. It would be surprising that a HTTP request had to copy the entire file into memory before sending it.

It may be reasonable to serve the Content-Encoding field based on the stored file though. If the client requests file “xyz”, there should be no encoding, but if the request was explicitly for “xyz.gz”, the server could add Content-Encoding. But I suspect this won’t help Pierre.

Some other thoughts on the pull request:
* x-gzip is supposed to be an alias in HTTP 1.1 requests
* The response is HTTP 1.0, where x-gzip seems to be the canonical name
* In HTTP 1.1 requests, consider supporting Accept-Encoding: gzip;q=1
* Accept-Encoding: gzip;q=0
* Accept-Encoding: *
* Accept-Encoding: GZIP (case insensitivity)
msg296387 - (view) Author: Glenn Linderman (v+python) * Date: 2017-06-19 23:14
@martin sez:
It may be reasonable to serve the Content-Encoding field based on the stored file though. If the client requests file “xyz”, there should be no encoding, but if the request was explicitly for “xyz.gz”, the server could add Content-Encoding. But I suspect this won’t help Pierre.

I think this suggestion violates the mapping between file name and expected data type: if the request is for "xyz.gz" then the requester wants the compressed form of the file "xyz", but adding the Content-Encoding header would cause (most known) browsers to decode the file in the network stack, and the resulting file displayed or saved would be the uncompressed form, with the compressed name.

While my implementation of the compression technique I outlined is, in fact, in a higher-level server which runs on top of either http.server or Apache, my perception of having to implement it at that level was that this is the sort of thing that the low-level server should be capable of, via configuration settings.

I note that for Apache, both mod_deflate and mod_brotli support either on-the-fly or pre-compressed data files, so it would appear that the authors of those modules agree with my perception that this should be a low-level server configuration thing.

Your example of a /tmp server, Martin, serves to point out the benefits of having pre-compressed files... much less storage is required.  I haven't looked at the pull request in detail: I'm not particularly interested in on-the-fly compression, but I wasn't certain until Pierre responded exactly what he was proposing.
msg296417 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-20 12:36
Thanks for the comments. I agree with some of them, and have improved the PR accordingly, but I don't agree on the opinion that HTTP compression is beyond the scope of http.server : like browser cache (which also implies a negociation between client and server) it's a basic, normalized feature of HTTP servers ; it doesn't change the intention of "mapping requests to the directory structure", it just changes the way some files are served, improving transfer speed and network load, which is especially useful on mobile networks.

The implementation makes HTTP compression the default for the types defined in SimpleHTTPRequestHandler.compressed_types, but it can be disabled if the system can't handle it, like in the case you mention : just by setting compressed_types to the empty list. I have made it more clear in the documentation.

I understand the concern for HEAD requests, but they must return the same headers as GET, so the compression must be done to determine the content-length header.

To address the case when the tmp zone is limited, I have also added a try/except block that checks if the temporary gzipped file can be created ; if not, the file is returned uncompressed.

I don't understand Martin's suggestion to use HTTP compression for .gz files : they are already compressed, there wouldn't be any benefit to compress them again. It's the same for audio or video files. Compression is only useful for uncompressed data such as text files, scripts, css files, etc. that are not already compressed. All those files are sent unmodified and with their own content type.

I have also improved the PR for the alternative forms of Accept-Encoding ("gzip;q=0", "*", "GZIP", etc.).
History
Date User Action Args
2017-06-20 12:36:49quentelsetmessages: + msg296417
2017-06-19 23:14:12v+pythonsetmessages: + msg296387
2017-06-18 00:58:23martin.pantersetmessages: + msg296247
2017-06-10 13:31:04quentelsetpull_requests: + pull_request2142
2017-06-10 13:18:04quentelsetmessages: + msg295635
2017-06-09 19:01:09v+pythonsetnosy: + v+python
messages: + msg295562
2017-06-06 20:20:52quentelsetmessages: + msg295300
2017-06-05 23:22:01martin.pantersetnosy: + martin.panter
messages: + msg295218
2017-06-05 19:42:25quentelsettype: enhancement
2017-06-05 19:42:04quentelcreate