Message 213883 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	skip.montanaro
Recipients	skip.montanaro
Date	2014-03-17.18:58:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1395082726.87.0.730756156096.issue20962@psf.upfronthosting.co.za>
In-reply-to

Content
I've had the opportunity to use the seek() method of the gzip.GzipFile class for the first time in the past few days. Wondering why it seemed my processing times were so slow, I took a look at the code for seek() and read(). It seems like the chunk size for reading (1024 bytes) is rather small. I created a simple subclass that overrode just seek() and read(), then defined a CHUNK_SIZE to be 16 * 8192 bytes (the whole idea of compressing files is that they get large, right? seems like most of the time we will want to seek pretty far through the file). Over a small subset of my inputs, I measured about a 2x decrease in run times, from about 54s to 26s. I ran using both gzip.GzipFile and my subclass several times, measuring the last four runs (two using the stdlib implementation, two using my subclass). I measured both the total time of the run, the time to process each input records, and time to execute just the seek() call for each record. The bulk of the per-record time was in the call to seek(), so by reducing that time, I sped up my run-times significantly. I'm still using 2.7, but other than the usual 2.x->3.x changes, the code looks pretty much the same between 2.7 and (at least) 3.3, and the logic involving the read size doesn't seem to have changed at all. I'll try to produce a patch if I have a few minutes, but in the meantime, I've attached my modified GzipFile class (produced against 2.7).

I've had the opportunity to use the seek() method of the gzip.GzipFile class for the first time in the past few days. Wondering why it seemed my processing times were so slow, I took a look at the code for seek() and read(). It seems like the chunk size for reading (1024 bytes) is rather small. I created a simple subclass that overrode just seek() and read(), then defined a CHUNK_SIZE to be 16 * 8192 bytes (the whole idea of compressing files is that they get large, right? seems like most of the time we will want to seek pretty far through the file).

Over a small subset of my inputs, I measured about a 2x decrease in run times, from about 54s to 26s. I ran using both gzip.GzipFile and my subclass several times, measuring the last four runs (two using the stdlib implementation, two using my subclass). I measured both the total time of the run, the time to process each input records, and time to execute just the seek() call for each record. The bulk of the per-record time was in the call to seek(), so by reducing that time, I sped up my run-times significantly.

I'm still using 2.7, but other than the usual 2.x->3.x changes, the code looks pretty much the same between 2.7 and (at least) 3.3, and the logic involving the read size doesn't seem to have changed at all.

I'll try to produce a patch if I have a few minutes, but in the meantime, I've attached my modified GzipFile class (produced against 2.7).

History
Date	User	Action	Args
2014-03-17 18:58:47	skip.montanaro	set	recipients: + skip.montanaro
2014-03-17 18:58:46	skip.montanaro	set	messageid: <1395082726.87.0.730756156096.issue20962@psf.upfronthosting.co.za>
2014-03-17 18:58:46	skip.montanaro	link	issue20962 messages
2014-03-17 18:58:46	skip.montanaro	create