Issue 21146: update gzip usage examples in docs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65345

classification

Title:	update gzip usage examples in docs
Type:	performance	Stage:	resolved
Components:	Documentation	Versions:	Python 3.4, Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	akuchling, docs@python, eric.araujo, maatt, methane, python-dev, wolma
Priority:	normal	Keywords:	patch

Created on 2014-04-03 10:40 by wolma, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
gzip_example_usage_patch.diff	wolma, 2014-04-08 16:49		review

Messages (10)
msg215440 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-03 10:40
The current documentation of the gzip module should have its section "12.2.1. Examples of usage" updated to reflect the changes made to the module in Python3.2 (https://docs.python.org/3.2/whatsnew/3.2.html#gzip-and-zipfile). Currently, the recipe given for gz-compressing a file is: import gzip with open('/home/joe/file.txt', 'rb') as f_in: with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out: f_out.writelines(f_in) which is clearly sub-optimal because it is line-based. An equally simple, but more efficient recipe would be: chunk_size = 1024 with open('/home/joe/file.txt', 'rb') as f_in: with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out: while True: c = f_in.read(chunk_size) if not c: break d = f_out.write(c) Comparing the two examples I find a >= 2x performance gain (both in terms of CPU time and wall time). In the inverse scenario of file de-compression (which is not part of the docs though), the performance increase of substituting: with gzip.open('/home/joe/file.txt.gz', 'rb') as f_in: with open('/home/joe/file.txt', 'wb') as f_out: f_out.writelines(f_in) with: with gzip.open('/home/joe/file.txt.gz', 'rb') as f_in: with open('/home/joe/file.txt', 'wb') as f_out: while True: c = f_in.read(chunk_size) if not c: break d = f_out.write(c) is even higher (4-5x speed-ups). In the de-compression case, another >= 2x speed-up can be achieved by avoiding the gzip module completely and going through a zlib.decompressobj instead, but of course this is a bit more complicated and should be documented in the zlib docs rather than the gzip docs (if you're interested, I could provide my code for it though). Using the zlib library compression/decompression speed gets comparable to linux gzip/gunzip.
msg215442 - (view)	Author: Inada Naoki (methane) *	Date: 2014-04-03 11:57
Maybe, shutil.copyfileobj() is good. import gzip import shutil with open(src, 'rb') as f_in: with gzip.open(dst, 'wb') as f_out: shutil.copyfileobj(f_in, f_out)
msg215443 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-03 12:44
>> with open(src, 'rb') as f_in: >> with gzip.open(dst, 'wb') as f_out: >> shutil.copyfileobj(f_in, f_out) +1 !! exactly as fast as my suggestion (with compression and de-compression), but a lot clearer ! Hadn't thought of it.
msg215444 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-03 12:50
same speed is not surprising though as shutil.copyfileobj is implemented like this: def copyfileobj(fsrc, fdst, length=16*1024): """copy data from file-like object fsrc to file-like object fdst""" while 1: buf = fsrc.read(length) if not buf: break fdst.write(buf) which is essentially what I was proposing :)
msg215773 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-08 16:49
ok, I've prepared the patch using the elegant shutil solution.
msg216110 - (view)	Author: Matt Chaput (maatt)	Date: 2014-04-14 16:54
The patch looks good to me.
msg216144 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2014-04-14 18:07
Isn’t there a buffering argument in open that can be used to avoid line buffering?
msg216280 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-15 07:43
well, buffering is not the issue here. It's that the file iterator used in the current example is line-based, so whatever the buffer size you're doing unnecessary inspection to find and split on line terminators.
msg240915 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-04-14 15:45
New changeset ae1528beae67 by Andrew Kuchling in branch 'default': #21146: give a more efficient recipe in gzip docs https://hg.python.org/cpython/rev/ae1528beae67
msg240916 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2015-04-14 15:47
Applied to trunk. Wolfgang Maier: thanks for your patch!

History
Date	User	Action	Args
2022-04-11 14:58:01	admin	set	github: 65345
2015-04-14 15:47:03	akuchling	set	nosy: + akuchling messages: + msg240916
2015-04-14 15:46:40	akuchling	set	status: open -> closed resolution: fixed stage: resolved
2015-04-14 15:45:22	python-dev	set	nosy: + python-dev messages: + msg240915
2014-04-15 07:43:34	wolma	set	messages: + msg216280
2014-04-14 18:07:19	eric.araujo	set	nosy: + eric.araujo messages: + msg216144 versions: + Python 3.5, - Python 3.2, Python 3.3
2014-04-14 16:54:57	maatt	set	nosy: + maatt messages: + msg216110
2014-04-08 16:49:08	wolma	set	files: + gzip_example_usage_patch.diff keywords: + patch messages: + msg215773
2014-04-03 12:50:40	wolma	set	messages: + msg215444
2014-04-03 12:44:39	wolma	set	messages: + msg215443
2014-04-03 11:57:06	methane	set	nosy: + methane messages: + msg215442
2014-04-03 10:40:06	wolma	create