classification
Title: update gzip usage examples in docs
Type: performance Stage: resolved
Components: Documentation Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: akuchling, docs@python, eric.araujo, maatt, methane, python-dev, wolma
Priority: normal Keywords: patch

Created on 2014-04-03 10:40 by wolma, last changed 2015-04-14 15:47 by akuchling. This issue is now closed.

Files
File name Uploaded Description Edit
gzip_example_usage_patch.diff wolma, 2014-04-08 16:49 review
Messages (10)
msg215440 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-03 10:40
The current documentation of the gzip module should have its section "12.2.1. Examples of usage" updated to reflect the changes made to the module in Python3.2 (https://docs.python.org/3.2/whatsnew/3.2.html#gzip-and-zipfile).

Currently, the recipe given for gz-compressing a file is:

import gzip
with open('/home/joe/file.txt', 'rb') as f_in:
    with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
        f_out.writelines(f_in)

which is clearly sub-optimal because it is line-based.

An equally simple, but more efficient recipe would be:

chunk_size = 1024
with open('/home/joe/file.txt', 'rb') as f_in:
    with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
        while True:
            c = f_in.read(chunk_size)
            if not c: break
            d = f_out.write(c)

Comparing the two examples I find a >= 2x performance gain (both in terms of CPU time and wall time).

In the inverse scenario of file *de*-compression (which is not part of the docs though), the performance increase of substituting:

with gzip.open('/home/joe/file.txt.gz', 'rb') as f_in:
    with open('/home/joe/file.txt', 'wb') as f_out:
        f_out.writelines(f_in)

with:

with gzip.open('/home/joe/file.txt.gz', 'rb') as f_in:
    with open('/home/joe/file.txt', 'wb') as f_out:
        while True:
            c = f_in.read(chunk_size)
            if not c: break
            d = f_out.write(c)

is even higher (4-5x speed-ups).

In the de-compression case, another >= 2x speed-up can be achieved by avoiding the gzip module completely and going through a zlib.decompressobj instead, but of course this is a bit more complicated and should be documented in the zlib docs rather than the gzip docs (if you're interested, I could provide my code for it though).
Using the zlib library compression/decompression speed gets comparable to linux gzip/gunzip.
msg215442 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2014-04-03 11:57
Maybe, shutil.copyfileobj() is good.

import gzip
import shutil

with open(src, 'rb') as f_in:
    with gzip.open(dst, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
msg215443 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-03 12:44
>> with open(src, 'rb') as f_in:
>>     with gzip.open(dst, 'wb') as f_out:
>>        shutil.copyfileobj(f_in, f_out)

+1 !!
exactly as fast as my suggestion (with compression and de-compression), but a lot clearer !
Hadn't thought of it.
msg215444 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-03 12:50
same speed is not surprising though as shutil.copyfileobj is implemented like this:

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

which is essentially what I was proposing :)
msg215773 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-08 16:49
ok, I've prepared the patch using the elegant shutil solution.
msg216110 - (view) Author: Matt Chaput (maatt) Date: 2014-04-14 16:54
The patch looks good to me.
msg216144 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2014-04-14 18:07
Isn’t there a buffering argument in open that can be used to avoid line buffering?
msg216280 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-15 07:43
well, buffering is not the issue here. It's that the file iterator used in the current example is line-based, so whatever the buffer size you're doing unnecessary inspection to find and split on line terminators.
msg240915 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-04-14 15:45
New changeset ae1528beae67 by Andrew Kuchling in branch 'default':
#21146: give a more efficient recipe in gzip docs
https://hg.python.org/cpython/rev/ae1528beae67
msg240916 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2015-04-14 15:47
Applied to trunk.  Wolfgang Maier: thanks for your patch!
History
Date User Action Args
2015-04-14 15:47:03akuchlingsetnosy: + akuchling
messages: + msg240916
2015-04-14 15:46:40akuchlingsetstatus: open -> closed
resolution: fixed
stage: resolved
2015-04-14 15:45:22python-devsetnosy: + python-dev
messages: + msg240915
2014-04-15 07:43:34wolmasetmessages: + msg216280
2014-04-14 18:07:19eric.araujosetnosy: + eric.araujo

messages: + msg216144
versions: + Python 3.5, - Python 3.2, Python 3.3
2014-04-14 16:54:57maattsetnosy: + maatt
messages: + msg216110
2014-04-08 16:49:08wolmasetfiles: + gzip_example_usage_patch.diff
keywords: + patch
messages: + msg215773
2014-04-03 12:50:40wolmasetmessages: + msg215444
2014-04-03 12:44:39wolmasetmessages: + msg215443
2014-04-03 11:57:06methanesetnosy: + methane
messages: + msg215442
2014-04-03 10:40:06wolmacreate