classification
Title: http.server should support HTTP compression (gzip)
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, martin.panter, quentel, r.david.murray, terry.reedy, v+python
Priority: normal Keywords:

Created on 2017-06-05 19:42 by quentel, last changed 2017-10-02 09:37 by haypo.

Pull Requests
URL Status Linked Edit
PR 2078 open quentel, 2017-06-10 13:31
Messages (26)
msg295207 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-05 19:42
The server in http.server currently doesn't support HTTP compression.

I propose to implement it in the method send_head() of SimpleHTTPRequestHandler this way : for each GET request, if the request header "Accept-Encoding" is present and includes "gzip" among the possible compression schemes, and if the Content-Type determined by the file extension is in a list compressed_types, then the server sends the "Content-Encoding" response header to "gzip" and send_head() returns a file object with the gzipped value.

compressed_types is an attribute of the SimpleHTTPRequestHandler class and is set by default to ["text/plain", "text/html", "text/css", "text/xml", "text/javascript", "application/javascript", "application/json"].

The implementation is very simple (a few lines of code).

I also propose to modify mimetypes to add the mapping of extension ".json" to "application/json".

I will make a Pull Request on the CPython Github site with these changes.
msg295218 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-06-05 23:22
Why do you want to this? Encoding files on the fly seems out of scope of the SimpleHTTPRequestHandler class to me, but perhaps a more flexible API that could be plugged in by the user could be beneficial.

See xmlrpc.server.SimpleXMLRPCRequestHandler.accept_encodings and related code for an existing implementation.

Did you consider using Transfer-Encoding instead of Content-Encoding?

What do you propose to do with Content-Length? What would happen to code that uses persistent HTTP connections?

There are a few bugs open about this in the client: Issue 1508475 discusses handling compression (especially via Content-Encoding), and Issue 4733 talks about text decoding, which would depend on decoding the Content-Encoding first.
msg295300 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-06 20:20
I propose this as a minor improvement to the built-in server, like the support of browser cache that will be included in Python 3.7 (issue #29654, PR #298). I understand that the server is not supposed to be full-featured, but HTTP compression is widespread, reduces network load and is easy to implement (the code will be very similar to SimpleXMLRPCRequestHandler).

Content-Encoding is used because it's the most simple to implement. Chunked transfer for large amount of data seems to me to be out of the scope of the built-in server.

Content-Length is set to the length of the compressed data. I don't understand your question about persistent connections : the proposal covers a single request / response sequence, it doesn't depend on the underlying TCP connection being reused or not.

From what I understand, issue #1508475 refers to the http client, not server ; and #4733 refers to another meaning of encoding (conversion from characters to bytes with a charset such as utf-8), not to HTTP compression, which unfortunately also uses "encoding" in headers names.
msg295562 - (view) Author: Glenn Linderman (v+python) * Date: 2017-06-09 19:01
I don't understand fully what you are planning here: to pre-compress the files, or to compress on the fly as mentioned by another commenter?

I've implemented, in a CGI behind http.server, both .gz and .br (gzip and brotli) compression, following these rules:

User requests file  xyz

If xyz doesn't exist, then look for xyz.gz or xyz.br.  If one of them exists, then serve it. But if the browser doesn't support gzip or br (as appropriate) then decompress on the fly, otherwise set the appropriate Content-Encoding, and send the compressed file.

This has worked out well. Of course, .br is only supported for https: transfers.  Most browsers support it now, except Apple.
msg295635 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-10 13:18
The compression is done on the fly : if compression criteria are satisfied, the original content is gzipped, either in memory or on a temporary file on disk, depending on the file size.

The gzipped content is not cached, but since the server now supports browser cache, on the next requests for the same file a 304 response will be sent.
msg296247 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-06-18 00:58
I think neither Pierre’s nor Glenn’s implementations should be added to SimpleHTTPRequestHandler. In fact I think most forms of content negotiation are only appropriate for a higher-level server. It seems too far removed from the intention of the class, “directly mapping the directory structure to HTTP requests”.

Another concern with Pierre’s proposal is the delay and memory or disk usage that would be incurred for a large file (e.g. text/plain log file), especially with HEAD requests. I have Linux computers set up with /tmp just held in memory, no disk file system nor swap. It would be surprising that a HTTP request had to copy the entire file into memory before sending it.

It may be reasonable to serve the Content-Encoding field based on the stored file though. If the client requests file “xyz”, there should be no encoding, but if the request was explicitly for “xyz.gz”, the server could add Content-Encoding. But I suspect this won’t help Pierre.

Some other thoughts on the pull request:
* x-gzip is supposed to be an alias in HTTP 1.1 requests
* The response is HTTP 1.0, where x-gzip seems to be the canonical name
* In HTTP 1.1 requests, consider supporting Accept-Encoding: gzip;q=1
* Accept-Encoding: gzip;q=0
* Accept-Encoding: *
* Accept-Encoding: GZIP (case insensitivity)
msg296387 - (view) Author: Glenn Linderman (v+python) * Date: 2017-06-19 23:14
@martin sez:
It may be reasonable to serve the Content-Encoding field based on the stored file though. If the client requests file “xyz”, there should be no encoding, but if the request was explicitly for “xyz.gz”, the server could add Content-Encoding. But I suspect this won’t help Pierre.

I think this suggestion violates the mapping between file name and expected data type: if the request is for "xyz.gz" then the requester wants the compressed form of the file "xyz", but adding the Content-Encoding header would cause (most known) browsers to decode the file in the network stack, and the resulting file displayed or saved would be the uncompressed form, with the compressed name.

While my implementation of the compression technique I outlined is, in fact, in a higher-level server which runs on top of either http.server or Apache, my perception of having to implement it at that level was that this is the sort of thing that the low-level server should be capable of, via configuration settings.

I note that for Apache, both mod_deflate and mod_brotli support either on-the-fly or pre-compressed data files, so it would appear that the authors of those modules agree with my perception that this should be a low-level server configuration thing.

Your example of a /tmp server, Martin, serves to point out the benefits of having pre-compressed files... much less storage is required.  I haven't looked at the pull request in detail: I'm not particularly interested in on-the-fly compression, but I wasn't certain until Pierre responded exactly what he was proposing.
msg296417 - (view) Author: Pierre Quentel (quentel) * Date: 2017-06-20 12:36
Thanks for the comments. I agree with some of them, and have improved the PR accordingly, but I don't agree on the opinion that HTTP compression is beyond the scope of http.server : like browser cache (which also implies a negociation between client and server) it's a basic, normalized feature of HTTP servers ; it doesn't change the intention of "mapping requests to the directory structure", it just changes the way some files are served, improving transfer speed and network load, which is especially useful on mobile networks.

The implementation makes HTTP compression the default for the types defined in SimpleHTTPRequestHandler.compressed_types, but it can be disabled if the system can't handle it, like in the case you mention : just by setting compressed_types to the empty list. I have made it more clear in the documentation.

I understand the concern for HEAD requests, but they must return the same headers as GET, so the compression must be done to determine the content-length header.

To address the case when the tmp zone is limited, I have also added a try/except block that checks if the temporary gzipped file can be created ; if not, the file is returned uncompressed.

I don't understand Martin's suggestion to use HTTP compression for .gz files : they are already compressed, there wouldn't be any benefit to compress them again. It's the same for audio or video files. Compression is only useful for uncompressed data such as text files, scripts, css files, etc. that are not already compressed. All those files are sent unmodified and with their own content type.

I have also improved the PR for the alternative forms of Accept-Encoding ("gzip;q=0", "*", "GZIP", etc.).
msg296799 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-06-25 04:07
For existing “.gz” files, I wasn’t suggesting to compress them a second time, just for the server report that they are already compressed, like how it reports the Content-Type value based on the file name. Including Content-Encoding would help browsers display pages like <http://www.yoyodesign.org/outils/ngzip/index.en.html.gz>.
msg296839 - (view) Author: Glenn Linderman (v+python) * Date: 2017-06-25 23:39
Martin, I understood what you meant, but sadly, doing that least to confusion. Follow your link, it displays fine, and then save the file. At least in Firefox, the default name to save as is "nGzip — A File Compressor.html". This looks appropriate, but the saved file is actually the compressed .gz form, so attempting to display it later, from the default name, displays the compressed gibberish, because the name does not reflect the encoding. Perhaps this should be considered a Firefox bug? Chrome saves the uncompressed version with the same default name. I can't actually figure out how to save the file from Edge, so don't know what it might do.

I'm surprised that Firefox, since it saves the compressed version, didn't offer the name "index.en.html.gz", and that Chrome, for the uncompressed version, didn't offer "index.en.html". Deriving the name from the title feels weird, but maybe only because I create web pages, and know what the real file names are. But this paragraph, other than the lack of ".gz" for Firefox naming, is veery off-topic.

The point I'm trying to make, though, is that the URIs shouldn't contain file extensions that include the compression, because that is confusing. The compression should be an internally negotiated detail between the browser and the web server, and the URI should reflect the content to be displayed, not the form in which it was encoded for transfer (or storage).  When .gz or .br is included in the URI, I would expect the browser to offer to save it to disk as a binary, compressed file, just like .zip. The variant behavior of Firefox and Chrome makes me wonder if there is even a standard that applies in this area... if there is, and if either one of them is following it, it is certainly not what I would expect.
msg296840 - (view) Author: Glenn Linderman (v+python) * Date: 2017-06-26 00:19
"veery" should be "veering" in above comment, sorry.
msg298550 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-17 18:03
I do not have a strong opinion on this issue, but I share Martin's doubts that it should be added.  It certainly should not be on by default if it is added.

We should get input from other core devs.
msg298639 - (view) Author: Pierre Quentel (quentel) * Date: 2017-07-19 07:00
Is Python-ideas the appropriate place to get input from other core devs ?
msg298672 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-19 12:57
Getting input from python ideas is a great idea :)
msg298744 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-07-20 19:08
I saw the python-ideas post.  I have no experience with http.server, but I just read the doc, and know something of our general enhancement policies.  I hope these comments help.

There are two questions: Should compression support be added to http.server?  If so, how?

To me, the purpose of the module is to provide a server and request handlers for toy, experimental, and basic production use, whether on a private or the public net.  Whatever the case was when the module was written, compression strikes me as a basic http feature now.

I agree with Martin that compression does not fit with the current definition of SimpleHTTPRequestHandler.  I suggest instead a subclass thereof.  CompressionHTTPRequestHandler?  ZippyHTTPRequestHandler?  Then add -zip to command line options.

If cgi + compression is relevant, a CompressionMixin might be posible, but I notice that there has been no suggestion so far that the combination is needed.

I suspect one motivation for adding compression to Simple... is to make it default.  I understand the desire to give users something for 'free', but changing default behavior usually breaks something somewhere and is therefore contrary to our general policy, and I definitely would not break it for this.
msg298783 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-21 08:43
I used a lot http.server in the past to transfer files between two computers. I like transparent compression on the file. The implementation doesn't seem so complex, gzip is now standard and client HTTP headers are used to decide if gzip is wanted or not.

But I have comments on the current implementation.

My main question is on the Content-Length. Can we skip it to not have to compress the whole file just to get its size, whereas we can use gzip as a "stream" when compress while we write compressed bytes on the socket?
msg298870 - (view) Author: Pierre Quentel (quentel) * Date: 2017-07-22 20:48
Thank you Terry and Victor for your comments. I understand that you agree on adding HTTP compression to http.server, but don't want it to be enabled by default.

@terry.reedy

With the implementation proposed in the Pull Request, to disable compression by default, all it takes is to set the attribute SimpleHTTPRequestHandler.compressed_types to the empty list.

Users who want to enable compression for some types would create subclasses of Simple... that set compressed_types. For instance :

import http.server

class CompressedHandler(http.server.SimpleHTTPRequestHandler):

    compressed_types = ["text/html", "text/plain"]

http.server.test(HandlerClass=CompressedHandler)


Would that be ok ?

For a command line argument --gzip, compressed_types could be set to a list of commonly compressed types - I don't think we want to pass a list of types on the command line.

For CGI scripts, I may be missing something but for me it is up to the script to determine the response headers and content, I don't think a mixin class in http.server could help.

@haypo

I fully agree on your comment about content-length, the implementation in the PR is not optimal.

The alternative would be to send the answer using chunked transfer encoding, which is currently not used in http.server, and would also be relevant for other use cases than gzipped files. It shouldn't be difficult to add, and I volunteer to propose an implementation, but it's yet another feature to add to the server. Do you think it's relevant ?
msg298874 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-07-22 21:36
I share RDM's sentiment and I'm not keen to add more complexity.

The http.server module in Python's stdlib is a useful example implementation of a simple HTTP server. I'd rather keep it simple and see a full featured HTTP server on PyPI. These days we don't have to include all batteries in Python's standard library. PyPI and pip made it super easy to install additional packages.

I might be convinced to add transparent gzip compression if it is done properly. A temporary file, compress-than-send, and whole content handling (content-length) are all wrong. Your current implementation will result in at least one CVE and a bunch of mails to Python security mailing list. Transparent compression must use chunked encoding with a small in-memory buffer instead of a file and zlib.compressobj() instead of gzip.GzipFile.
msg299044 - (view) Author: Pierre Quentel (quentel) * Date: 2017-07-25 08:00
In the latest version of the Pull Request (https://github.com/python/cpython/pull/2078/commits/6466c93555bec521860c57e832b691fe7f0c6c20) :
- compression is disabled by default (compressed_types is set to [])
- as suggested by Chris Barker in the discussion on python-ideas, I added a list of commonly compressed types that can be used as the value of compressed_types in subclasses of SimpleHTTPRequestHandler. For want of official sources I took the list from https://github.com/h5bp/server-configs-apache.
- a command-line option --gzip has been introduced, when it is set, the list of commonly compressed types is used
- the implementation of compression for "big files" has been changed : as suggested by Victor and Christian, no more temporary file, compressed data is now sent as chunks using Chunked Transfer Encoding. The file object returned by send_head() is a generator that produces the chunks of compressed data
msg299088 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-07-25 14:16
I think chunked encoding is only meant to be used for HTTP 1.1. For HTTP 1.0, you have to either send Content-Length, or shut down the connection after sending the body. See also Issue 21224 about improving HTTP 1.1 support.

Maybe you should add a “Vary: accept-encoding” field (even if gzip not acceptable to the client).

I wonder if it is possible to make some of the code more generic? E.g. rather than being coupled to SimpleHTTPRequestHandler, perhaps the chunk encoder, request parsing, etc could also be usable by custom servers. We already have one chunk encoder in “http.client” and an Accept-Encoding parser in “xmlrpc.server”.

FWIW I think using GzipFile should be safe if done right. You would give it a custom writer class that accepted gzip-encoded chunks, added HTTP chunk encoding (for HTTP 1.1), and sent them to the client. IMO this is a more flexible way of doing a chunk encoder.
msg299217 - (view) Author: Pierre Quentel (quentel) * Date: 2017-07-26 10:08
@martin.panter

For HTTP/1.0, since chunked transfer is not supported, and storage in a temporary file is also not an option, I see 2 possible solutions :
- give up compressing big files - it would be a pity, compression is actually made for them...
- compress the file 2 times : a first time just to compute the content length, without storing or sending anything, and a second time to send the gzipped data after all headers have been sent

If there is a 3rd solution I'd be glad to know ; otherwise I prefer the second one, in spite of the waste of CPU.
msg299366 - (view) Author: Pierre Quentel (quentel) * Date: 2017-07-28 06:01
@martin.panter

Please forget my previous message. There is a 3rd solution, and you gave it : no Content-Length and close the connection when all (compressed) data has been sent.
msg299664 - (view) Author: Pierre Quentel (quentel) * Date: 2017-08-02 17:02
In the latest version of the PR, following Martin's comments :
- apply Chunk Transfer for HTTP/1.1 only, change implementation of compression for previous protocols (send gzipped data without Content-Length)
- use http.cookiejar to parse the Accept-Encoding header
- fix a bug with chunk length (conversion to hex)
- support x-gzip besides gzip
- handle Python builds without zlib / gzip

Headers parsing is done in several places in the standard distribution. It should probably be done in a single module, but I think it would be better to open a new issue for that, as it would impact more modules than just http.server.

I couldn't find a simple way to reuse code from http.client to generate HTTP chunks (it's in HTTPConnection._send_output()), but I'm not sure it's worth it, the code to generate a chunk is a one-liner.
msg300285 - (view) Author: Pierre Quentel (quentel) * Date: 2017-08-15 08:00
On Python-ideas someone asked if other compressions could be supported besides gzip.

The latest version of the PR adds a mechanism for that : SimpleHTTPRequestHandler has a new attribute "compressions", a dictionary that maps compression encodings (eg "gzip") to a "compressed data generator". The generator takes a file object as argument, yields non-empty chunks of compressed data and ends by yielding b'' for compliance with Chunked Transfer Encoding protocol.

To support other compression algorithms, "compressions" can be extended with another key (eg "brotli") mapped to the appropriate generator. A test has been added with the non-standard "bzip2" encoding, using the bz2 module in the standard distribution.

I also added support for "deflate" by default (it's very close to "gzip").
msg303128 - (view) Author: Stefan Behnel (scoder) * Date: 2017-09-27 11:28
FWIW, both the feature and the PR look ok to me. Code formatting is a little funny at times, but the implementation looks good.
msg303379 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-09-30 00:31
Regarding the compressed data generator, it would be better if there were no restrictions on the generator yielding empty chunks. This would match how the upload “body” parameter for HTTPConnection.request can be an iterator without worrying about empty chunks. IMO a chunked encoder should skip empty chunks and add the trailer itself, rather than exposing these special requirements to higher levels.
History
Date User Action Args
2017-10-02 09:37:51hayposetnosy: - haypo
2017-09-30 00:31:20martin.pantersetmessages: + msg303379
2017-09-27 11:29:10scodersetnosy: - scoder
2017-09-27 11:28:49scodersetnosy: + scoder
messages: + msg303128
2017-08-15 08:00:02quentelsetmessages: + msg300285
2017-08-02 17:02:28quentelsetmessages: + msg299664
2017-07-28 06:01:09quentelsetmessages: + msg299366
2017-07-26 10:08:34quentelsetmessages: + msg299217
2017-07-25 14:16:51martin.pantersetmessages: + msg299088
2017-07-25 08:00:41quentelsetmessages: + msg299044
2017-07-22 21:36:11christian.heimessetnosy: + christian.heimes
messages: + msg298874
2017-07-22 20:48:02quentelsetmessages: + msg298870
2017-07-21 08:43:38hayposetnosy: + haypo
messages: + msg298783
2017-07-20 19:08:29terry.reedysetnosy: + terry.reedy
messages: + msg298744
2017-07-19 12:57:08r.david.murraysetmessages: + msg298672
2017-07-19 07:00:21quentelsetmessages: + msg298639
2017-07-17 18:03:58r.david.murraysetnosy: + r.david.murray
messages: + msg298550
2017-06-26 00:19:27v+pythonsetmessages: + msg296840
2017-06-25 23:39:14v+pythonsetmessages: + msg296839
2017-06-25 04:07:33martin.pantersetmessages: + msg296799
2017-06-20 12:36:49quentelsetmessages: + msg296417
2017-06-19 23:14:12v+pythonsetmessages: + msg296387
2017-06-18 00:58:23martin.pantersetmessages: + msg296247
2017-06-10 13:31:04quentelsetpull_requests: + pull_request2142
2017-06-10 13:18:04quentelsetmessages: + msg295635
2017-06-09 19:01:09v+pythonsetnosy: + v+python
messages: + msg295562
2017-06-06 20:20:52quentelsetmessages: + msg295300
2017-06-05 23:22:01martin.pantersetnosy: + martin.panter
messages: + msg295218
2017-06-05 19:42:25quentelsettype: enhancement
2017-06-05 19:42:04quentelcreate