This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add write buffering to gzip
Type: enhancement Stage: needs patch
Components: Extension Modules Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, ebfe, neologix, pitrou, rhettinger
Priority: normal Keywords:

Created on 2006-06-05 16:40 by rhettinger, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
gztest.py rhettinger, 2006-06-05 16:40 Script to generate comparative timings
Messages (6)
msg54815 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2006-06-05 16:40
A series of write() calls is dog slow compared to
building-up a pool of data and then writing it in
larger batches.

The attached script demonstrates the speed-up
potential.  It compares a series of GzipFile.write()
calls to an alternate approach using cStringIO.write()
calls followed by a GzipFile.write(sio.getvalue()).  On
my box, there is a three-fold speed-up.
msg83968 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-22 11:14
Although the script does not work as-is (missing import of "string",
typo between "frags" and "wfrags"), I can conform the 3x ratio.
msg83984 - (view) Author: Lukas Lueg (ebfe) Date: 2009-03-22 21:21
This is true for all objects whose input could be concatenated.

For example with hashlib:

data = ['foobar']*100000
mdX = hashlib.sha1()
for d in data:
    mdX.update(d)
mdY = hashlib.sha1()
mdY.update("".join(data))

mdX.digest() == mdY.digest()

the second version is multiple times faster...
msg102233 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-03 10:20
In the test script, simply changing 

def emit(f, data=snips):
    for datum in data:
        f.write(datum)

to 

def gemit(f, data=snips):
    datas = ''.join(data)
    f.write(datas)

improves direct gzip performance from
[1.1799781322479248, 0.50524115562438965, 0.2713780403137207]
[1.183434009552002, 0.50997591018676758, 0.26801109313964844]
[1.173914909362793, 0.51325297355651855, 0.26233196258544922]

to

[0.43065404891967773, 0.50007486343383789, 0.26698708534240723]
[0.43662095069885254, 0.49983596801757812, 0.2686460018157959]
[0.43778109550476074, 0.50057196617126465, 0.2687230110168457]

which means that you're better off letting the application handle buffering issues. Furthermore, the problem with gzip-level buffering is the choice of the default buffer size.

Time to close ?
msg102249 - (view) Author: Lukas Lueg (ebfe) Date: 2010-04-03 12:33
agreed
msg102253 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-04-03 12:45
Additionally, since issue7471 was fixed, you should be able to wrap a GzipFile in a Buffered{Reader,Writer} object for faster buffering.
History
Date User Action Args
2022-04-11 14:56:17adminsetgithub: 43459
2010-04-03 12:45:17pitrousetstatus: open -> closed
resolution: out of date
messages: + msg102253
2010-04-03 12:33:36ebfesetmessages: + msg102249
2010-04-03 10:20:09neologixsetnosy: + neologix
messages: + msg102233
2009-03-22 21:21:27ebfesetnosy: + ebfe
messages: + msg83984
2009-03-22 11:14:32pitrousetnosy: + pitrou
messages: + msg83968
2009-03-21 03:39:38ajaksu2setnosy: + ajaksu2
versions: + Python 3.1, Python 2.7

stage: needs patch
2006-06-05 16:40:29rhettingercreate