classification
Title: Performance for small reads and fix seek problem
Type: Stage: resolved
Components: Extension Modules Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, florianfesti, georg.brandl, karld, lucas_malor, nils, pitrou, rharris
Priority: normal Keywords: patch

Created on 2007-03-07 17:57 by florianfesti, last changed 2010-09-23 16:24 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
test_gzip2.py florianfesti, 2007-03-07 17:58 Performance test for gzip module
results.txt florianfesti, 2007-03-07 18:01 Performance comparison
test_gzip.py.diff florianfesti, 2007-03-07 18:24
test_gzip.py-noseek.diff florianfesti, 2007-03-09 09:24 Check if read and readline use anything else than read() on the input file
gzip.py.diff florianfesti, 2007-03-15 17:43 new read() readline() implementation for gzip
test_gzip2-py3.py florianfesti, 2010-09-18 10:14 Python3 performance test for gzip module
result-py3.txt florianfesti, 2010-09-18 10:20 Performance result for the stdlib gzip module (Python 3.1 on Fedora 14 x86_64)
0001-Avoid-the-need-of-seek-ing-on-the-file-read.patch florianfesti, 2010-09-22 10:09 temporary wrap fileobj with PaddedFile
0002-Avoid-the-need-of-seek-ing-on-the-file-read-2.patch florianfesti, 2010-09-22 10:10 Wrap fileobj all the time
Messages (23)
msg52086 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-07 17:57
This patch improves read performance of the gzip module. We have seen improvments from 20% from reading big blocks up to factor 50 for reading 4 byte blocks. Additionally this patch removes the need of seeking within the file with allows using streams like stdin as input.

Details:

The whole read(),readline() stack got rewritten. Core part is a new input buffer. It consists of a list of strings (self.buffer), an offset of what has already be consumed from the first string (self.pos) and the length of the (still consumable) buffer (self.bufferlen). This makes adding to and removing from the buffer cheap. It turned out that removing from the old buffer was breaking performance as for reading one byte the whole buffer had to be copied. For reading a 2k buffer in single bytes 2MB had to be copied.

readline no longer uses read() but directly works on the buffer. This removes a whole layer of copying strings together.

For removing the need of seeking a new readonly filelike class is used (PaddedFile). It just prepends a string to a file and uses the file's read method when the string got used up.

There is probably still some space for tweaking when it comes to buffere sizes as we kept this simple. But the great performance improvments show that we can't be off that much.

Performance test program and results are attached.
msg52087 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-07 17:58
File Added: test_gzip2.py
msg52088 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-07 18:01
File Added: results.txt
msg52089 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-07 18:24
Added minor improvement of the unittest: Check for newlines in the readline() test code.
File Added: test_gzip.py.diff
msg52090 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-03-08 20:58
The patch looks good, and the tests pass. Can you add a test that ensures that a seek() method is not necessary any more for the underlying stream?

(closed #914340 which provided a similar patch as duplicate)
msg52091 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-09 09:24
I added checks to test_readline() and test_append(). I left the other read test untouched to keep some filename=filename  coverage.

BTW: I really forgot special thanks for Phil Knirsch who wrote the initial read() implementation and the performance test cases and did a lot of weaking and testing and without whom this patch would never have existed.
File Added: test_gzip.py-noseek.diff
msg52092 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-03-15 16:42
Excuse me, but I can't apply the patch. I have Windows XP without any SP and I tried to do the command

patch -u gzip.py gzip.py.diff
msg52093 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-15 17:43
Hm, works one Linux.
Try this one
File Added: gzip.py.diff
msg52094 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-03-15 18:30
I applied the patch by hand, I think the problem is simply it's for python 2.6 (I have the stable 2.5 version)

Anyway like I wrote for an old similar patch of another user, the patch starts to read the header at the current position, and not at the start of the file. You can see it trying this piece of code:

---------------------------------------
import urllib2
import array
import gzip

urlfile = urllib2.urlopen(someurl)
header = array.array("B")
header.fromstring(urlfile.read(1))

gzfile = gzip.GzipFile(fileobj=urlfile)
print gzfile.read()
-----------------------------------------

Error:
------------------------------------------------------------------------------
  File "c:\My Programs\Python\lib\gzip.py", line 285, in read
    self._read_gzip_header(self.fileobj)
  File "c:\My Programs\Python\lib\gzip.py", line 177, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
>Exit code: 1
------------------------------------------------------------------------------

I don't know how you can solve this without seek()

Anyway if you are interested I created the diff for Python 2.5 :

http://digilander.libero.it/LucasMalor/gzip_2.5.py.diff.gz
msg52095 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-16 08:27
There is a simple solution for that problem: DON'T!

If you pass a file descriptor to the GzipFile class it is your responsibility that the file is a gzip file and that the file pointer is at the right position. After having a short look into original code I don't think it can cope with the use case you brought up.

I'd even argue that seeking to the beginning of the file is broken behaviour. What if the gzip content doesn't start at the beginning of the file? In fact prepending some header before compressed data is quite common. If the file you where reading really had a header your example had just worked.
msg52096 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-03-16 09:06
Indeed the problem is seek() is not implementes for urllib objects. It's not a bug of your patch, sorry if I explained it bad.
msg52097 - (view) Author: Florian Festi (florianfesti) Date: 2007-03-21 10:49
There is a simple solution for that problem: DON'T!

If you pass a file descriptor to the GzipFile class it is your responsibility that the file is a gzip file and that the file pointer is at the right position. After having a short look into original code I don't think it can cope with the use case you brought up.

I'd even argue that seeking to the beginning of the file is broken behaviour. What if the gzip content doesn't start at the beginning of the file? In fact prepending some header before compressed data is quite common. If the file you where reading really had a header your example had just worked.
msg96817 - (view) Author: Karl D'Adamo (karld) Date: 2009-12-22 21:21
This patch is awesome.  It makes it possible to do this with http
response objects that return gzipped streams:

>>> conn = httplib.HTTPConnection('blah')
>>> req = conn.request('GET', 'a.gz')
>>> resp = conn.getresponse()
>>> unzipper = gzip.GzipFile(fileobj=resp)
# read just the first 100 lines
>>> for i in range(100):
    print unzipper.readline()
msg107987 - (view) Author: Rick Harris (rharris) Date: 2010-06-17 02:24
Are compatibility concerns the main reason this patch has yet to be applied?

If so, could we allay those fears by keeping the default seek-y behavior and only introducing the new seek-less style when a 'noseek' flag is passed?
msg107995 - (view) Author: Florian Festi (florianfesti) Date: 2010-06-17 08:25
There are no compatibility concerns I am aware of. The new implementation does no longer need a seek()able file. Of course an implemented seek() method won't hurt anyone. The additional tests are only there to point out the problems of the old implementation.

So there is no flag needed to maintain compatibility. The patch just has to be reviewed and then to be applied. If there are any concerns or questions I'll be glad to assist.

Florian
msg108000 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-17 10:49
The patch could only be applied to 3.2 (2.7 is frozen now). But the gzip code has changed quite a bit and I would advocate creating a new patch if you are interested.
Do notice that performance should also be much better in 3.2, and it is possible to wrap a gzip object in a io.BufferedReader object so as to reach even better performance.
msg116622 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-09-16 22:01
As there has been a lot of support for the attached patch what is needed to take this forward?
msg116627 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-16 22:08
> As there has been a lot of support for the attached patch what is
> needed to take this forward?

Read my 2010-06-17 message above. Someone needs to update the patch for 3.2, and preferable show that it still brings an improvement (small micro-benchmarks are enough).
msg116759 - (view) Author: Florian Festi (florianfesti) Date: 2010-09-18 10:14
I updated the performace script to Py3. You still need to change the import gzipnew line to actually load the modified module. Right now it just compares the stdlib gzip module to itself.
msg116760 - (view) Author: Florian Festi (florianfesti) Date: 2010-09-18 10:20
Attached result of a run with stdlib gzip module only. Results indicate that performance still is as bad as on Python 2. The Python 3 gzip module also still makes use of tell() ans seek(). So both argument for including this patch are still valid.

Porting the patch will include some real work to get the bytes vs string split right.
msg116763 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-18 11:35
> Attached result of a run with stdlib gzip module only. Results
> indicate that performance still is as bad as on Python 2. The Python 3
> gzip module also still makes use of tell() ans seek(). So both
> argument for including this patch are still valid.

Performance is easily improved by wrapping the file object in a
io.BufferedReader or io.BufferedWriter:

Text 1 byte block test
Original gzip Write:    2.125 s Read:    0.683 s
New gzip      Write:    0.390 s Read:    0.240 s

Text 4 byte block test
Original gzip Write:    1.077 s Read:    0.351 s
New gzip      Write:    0.204 s Read:    0.132 s

Text 16 byte block test
Original gzip Write:    1.119 s Read:    0.353 s
New gzip      Write:    0.264 s Read:    0.137 s

Still, fixing the seek()/tell() issue would be nice.
msg117131 - (view) Author: Florian Festi (florianfesti) Date: 2010-09-22 10:09
Stupid me! I ran the tests against my systems gzip version (Py 3.1). The performance issue is basically fixed by rev 77289. Performance is even a bit better that my original patch by may be 10-20%. The only test case where it performs worse is 

Random 10485760 byte block test
Original gzip Write:   20.452 s Read:    2.931 s
New gzip      Write:   20.518 s Read:    1.247 s

Don't know if it is worth bothering. May be increasing the maximum chunk size improves this - but I didn't try that out yet.

WRT to seeking:

I now have two patches that eliminate the need for seek() on normal operation (rewind obviously still needs seek()). Both are based on the PaddedFile class. The first patch just creates a PaddedFile object while switching from an old to a new member while the second just wraps the fileobj all the time. Performance test show that wrapping is cheap. The first patch is a bit ugly while the second requires a implementation of seek() and may create problems if new methods of the fileobj are used that may interfere with the PaddedFile's internals.

So I leave the choice which one is preferred to the module owner.

The patch creates another problem with is not yet fixed: The implementation of .seekable() is becoming wrong. As one can now use non seekable files the implementation should check if the file object used for reading is really seekable. As this is my first PY3k work I'd prefer if this can be solved by someone else (But that should be pretty easy).
msg117207 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-23 16:24
Thank you very much! I have kept the second approach (use PaddedFile at all times), since it is more regular and minimizes the probability for borderline cases.
As for the supposed performance slowdown, it doesn't seem significant. On large blocks of data, I expect that compression/decompression cost will be overwhelming anyway.
I've added a test case and committed the patch in r84976. Don't hesitate to contribute again.
History
Date User Action Args
2010-09-23 16:24:49pitrousetstatus: open -> closed
resolution: fixed
messages: + msg117207

stage: resolved
2010-09-22 10:10:02florianfestisetfiles: + 0002-Avoid-the-need-of-seek-ing-on-the-file-read-2.patch
2010-09-22 10:09:29florianfestisetfiles: + 0001-Avoid-the-need-of-seek-ing-on-the-file-read.patch

messages: + msg117131
2010-09-18 11:35:23pitrousetmessages: + msg116763
title: [gzip] Performance for small reads and fix seek problem -> Performance for small reads and fix seek problem
2010-09-18 10:20:05florianfestisetfiles: + result-py3.txt

messages: + msg116760
2010-09-18 10:14:52florianfestisetfiles: + test_gzip2-py3.py

messages: + msg116759
2010-09-16 22:08:08pitrousetmessages: + msg116627
2010-09-16 22:01:36BreamoreBoysetnosy: + BreamoreBoy
messages: + msg116622
2010-06-17 10:49:12pitrousetmessages: + msg108000
versions: + Python 3.2, - Python 2.6
2010-06-17 08:54:17nilssetnosy: + nils
2010-06-17 08:25:33florianfestisetmessages: + msg107995
2010-06-17 02:24:10rharrissetnosy: + rharris
messages: + msg107987
2009-12-22 21:21:13karldsetnosy: + karld
messages: + msg96817
2009-03-31 19:35:37georg.brandllinkissue1243678 superseder
2009-01-18 12:17:57pitrousetnosy: + pitrou
2007-03-07 17:57:08florianfesticreate