Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance for small reads and fix seek problem #44677

Closed
florianfesti mannequin opened this issue Mar 7, 2007 · 23 comments
Closed

Performance for small reads and fix seek problem #44677

florianfesti mannequin opened this issue Mar 7, 2007 · 23 comments
Labels
extension-modules C modules in the Modules dir

Comments

@florianfesti
Copy link
Mannequin

florianfesti mannequin commented Mar 7, 2007

BPO 1675951
Nosy @birkenfeld, @pitrou, @nphilipp
Files
  • test_gzip2.py: Performance test for gzip module
  • results.txt: Performance comparison
  • test_gzip.py.diff
  • test_gzip.py-noseek.diff: Check if read and readline use anything else than read() on the input file
  • gzip.py.diff: new read() readline() implementation for gzip
  • test_gzip2-py3.py: Python3 performance test for gzip module
  • result-py3.txt: Performance result for the stdlib gzip module (Python 3.1 on Fedora 14 x86_64)
  • 0001-Avoid-the-need-of-seek-ing-on-the-file-read.patch: temporary wrap fileobj with PaddedFile
  • 0002-Avoid-the-need-of-seek-ing-on-the-file-read-2.patch: Wrap fileobj all the time
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-23.16:24:49.221>
    created_at = <Date 2007-03-07.17:57:08.000>
    labels = ['extension-modules']
    title = 'Performance for small reads and fix seek problem'
    updated_at = <Date 2010-09-23.16:24:49.220>
    user = 'https://bugs.python.org/florianfesti'

    bugs.python.org fields:

    activity = <Date 2010-09-23.16:24:49.220>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-23.16:24:49.221>
    closer = 'pitrou'
    components = ['Extension Modules']
    creation = <Date 2007-03-07.17:57:08.000>
    creator = 'florianfesti'
    dependencies = []
    files = ['7835', '7836', '7837', '7838', '7839', '18916', '18917', '18964', '18965']
    hgrepos = []
    issue_num = 1675951
    keywords = ['patch']
    message_count = 23.0
    messages = ['52086', '52087', '52088', '52089', '52090', '52091', '52092', '52093', '52094', '52095', '52096', '52097', '96817', '107987', '107995', '108000', '116622', '116627', '116759', '116760', '116763', '117131', '117207']
    nosy_count = 8.0
    nosy_names = ['georg.brandl', 'pitrou', 'lucas_malor', 'florianfesti', 'rharris', 'karld', 'nils', 'BreamoreBoy']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1675951'
    versions = ['Python 3.2']

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 7, 2007

    This patch improves read performance of the gzip module. We have seen improvments from 20% from reading big blocks up to factor 50 for reading 4 byte blocks. Additionally this patch removes the need of seeking within the file with allows using streams like stdin as input.

    Details:

    The whole read(),readline() stack got rewritten. Core part is a new input buffer. It consists of a list of strings (self.buffer), an offset of what has already be consumed from the first string (self.pos) and the length of the (still consumable) buffer (self.bufferlen). This makes adding to and removing from the buffer cheap. It turned out that removing from the old buffer was breaking performance as for reading one byte the whole buffer had to be copied. For reading a 2k buffer in single bytes 2MB had to be copied.

    readline no longer uses read() but directly works on the buffer. This removes a whole layer of copying strings together.

    For removing the need of seeking a new readonly filelike class is used (PaddedFile). It just prepends a string to a file and uses the file's read method when the string got used up.

    There is probably still some space for tweaking when it comes to buffere sizes as we kept this simple. But the great performance improvments show that we can't be off that much.

    Performance test program and results are attached.

    @florianfesti florianfesti mannequin added extension-modules C modules in the Modules dir labels Mar 7, 2007
    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 7, 2007

    File Added: test_gzip2.py

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 7, 2007

    File Added: results.txt

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 7, 2007

    Added minor improvement of the unittest: Check for newlines in the readline() test code.
    File Added: test_gzip.py.diff

    @birkenfeld
    Copy link
    Member

    The patch looks good, and the tests pass. Can you add a test that ensures that a seek() method is not necessary any more for the underlying stream?

    (closed bpo-914340 which provided a similar patch as duplicate)

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 9, 2007

    I added checks to test_readline() and test_append(). I left the other read test untouched to keep some filename=filename coverage.

    BTW: I really forgot special thanks for Phil Knirsch who wrote the initial read() implementation and the performance test cases and did a lot of weaking and testing and without whom this patch would never have existed.
    File Added: test_gzip.py-noseek.diff

    @lucasmalor
    Copy link
    Mannequin

    lucasmalor mannequin commented Mar 15, 2007

    Excuse me, but I can't apply the patch. I have Windows XP without any SP and I tried to do the command

    patch -u gzip.py gzip.py.diff

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 15, 2007

    Hm, works one Linux.
    Try this one
    File Added: gzip.py.diff

    @lucasmalor
    Copy link
    Mannequin

    lucasmalor mannequin commented Mar 15, 2007

    I applied the patch by hand, I think the problem is simply it's for python 2.6 (I have the stable 2.5 version)

    Anyway like I wrote for an old similar patch of another user, the patch starts to read the header at the current position, and not at the start of the file. You can see it trying this piece of code:

    ---------------------------------------

    import urllib2
    import array
    import gzip
    
    urlfile = urllib2.urlopen(someurl)
    header = array.array("B")
    header.fromstring(urlfile.read(1))
    
    gzfile = gzip.GzipFile(fileobj=urlfile)
    print gzfile.read()

    Error:
    ------------------------------------------------------------------------------
    File "c:\My Programs\Python\lib\gzip.py", line 285, in read
    self._read_gzip_header(self.fileobj)
    File "c:\My Programs\Python\lib\gzip.py", line 177, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
    IOError: Not a gzipped file

    Exit code: 1
    ------------------------------------------------------------------------------

    I don't know how you can solve this without seek()

    Anyway if you are interested I created the diff for Python 2.5 :

    http://digilander.libero.it/LucasMalor/gzip_2.5.py.diff.gz

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 16, 2007

    There is a simple solution for that problem: DON'T!

    If you pass a file descriptor to the GzipFile class it is your responsibility that the file is a gzip file and that the file pointer is at the right position. After having a short look into original code I don't think it can cope with the use case you brought up.

    I'd even argue that seeking to the beginning of the file is broken behaviour. What if the gzip content doesn't start at the beginning of the file? In fact prepending some header before compressed data is quite common. If the file you where reading really had a header your example had just worked.

    @lucasmalor
    Copy link
    Mannequin

    lucasmalor mannequin commented Mar 16, 2007

    Indeed the problem is seek() is not implementes for urllib objects. It's not a bug of your patch, sorry if I explained it bad.

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Mar 21, 2007

    There is a simple solution for that problem: DON'T!

    If you pass a file descriptor to the GzipFile class it is your responsibility that the file is a gzip file and that the file pointer is at the right position. After having a short look into original code I don't think it can cope with the use case you brought up.

    I'd even argue that seeking to the beginning of the file is broken behaviour. What if the gzip content doesn't start at the beginning of the file? In fact prepending some header before compressed data is quite common. If the file you where reading really had a header your example had just worked.

    @karld
    Copy link
    Mannequin

    karld mannequin commented Dec 22, 2009

    This patch is awesome. It makes it possible to do this with http
    response objects that return gzipped streams:

    >>> conn = httplib.HTTPConnection('blah')
    >>> req = conn.request('GET', 'a.gz')
    >>> resp = conn.getresponse()
    >>> unzipper = gzip.GzipFile(fileobj=resp)
    # read just the first 100 lines
    >>> for i in range(100):
        print unzipper.readline()

    @rharris
    Copy link
    Mannequin

    rharris mannequin commented Jun 17, 2010

    Are compatibility concerns the main reason this patch has yet to be applied?

    If so, could we allay those fears by keeping the default seek-y behavior and only introducing the new seek-less style when a 'noseek' flag is passed?

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Jun 17, 2010

    There are no compatibility concerns I am aware of. The new implementation does no longer need a seek()able file. Of course an implemented seek() method won't hurt anyone. The additional tests are only there to point out the problems of the old implementation.

    So there is no flag needed to maintain compatibility. The patch just has to be reviewed and then to be applied. If there are any concerns or questions I'll be glad to assist.

    Florian

    @pitrou
    Copy link
    Member

    pitrou commented Jun 17, 2010

    The patch could only be applied to 3.2 (2.7 is frozen now). But the gzip code has changed quite a bit and I would advocate creating a new patch if you are interested.
    Do notice that performance should also be much better in 3.2, and it is possible to wrap a gzip object in a io.BufferedReader object so as to reach even better performance.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 16, 2010

    As there has been a lot of support for the attached patch what is needed to take this forward?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 16, 2010

    As there has been a lot of support for the attached patch what is
    needed to take this forward?

    Read my 2010-06-17 message above. Someone needs to update the patch for 3.2, and preferable show that it still brings an improvement (small micro-benchmarks are enough).

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Sep 18, 2010

    I updated the performace script to Py3. You still need to change the import gzipnew line to actually load the modified module. Right now it just compares the stdlib gzip module to itself.

    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Sep 18, 2010

    Attached result of a run with stdlib gzip module only. Results indicate that performance still is as bad as on Python 2. The Python 3 gzip module also still makes use of tell() ans seek(). So both argument for including this patch are still valid.

    Porting the patch will include some real work to get the bytes vs string split right.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 18, 2010

    Attached result of a run with stdlib gzip module only. Results
    indicate that performance still is as bad as on Python 2. The Python 3
    gzip module also still makes use of tell() ans seek(). So both
    argument for including this patch are still valid.

    Performance is easily improved by wrapping the file object in a
    io.BufferedReader or io.BufferedWriter:

    Text 1 byte block test
    Original gzip Write: 2.125 s Read: 0.683 s
    New gzip Write: 0.390 s Read: 0.240 s

    Text 4 byte block test
    Original gzip Write: 1.077 s Read: 0.351 s
    New gzip Write: 0.204 s Read: 0.132 s

    Text 16 byte block test
    Original gzip Write: 1.119 s Read: 0.353 s
    New gzip Write: 0.264 s Read: 0.137 s

    Still, fixing the seek()/tell() issue would be nice.

    @pitrou pitrou changed the title [gzip] Performance for small reads and fix seek problem Performance for small reads and fix seek problem Sep 18, 2010
    @pitrou pitrou changed the title [gzip] Performance for small reads and fix seek problem Performance for small reads and fix seek problem Sep 18, 2010
    @florianfesti
    Copy link
    Mannequin Author

    florianfesti mannequin commented Sep 22, 2010

    Stupid me! I ran the tests against my systems gzip version (Py 3.1). The performance issue is basically fixed by rev 77289. Performance is even a bit better that my original patch by may be 10-20%. The only test case where it performs worse is

    Random 10485760 byte block test
    Original gzip Write: 20.452 s Read: 2.931 s
    New gzip Write: 20.518 s Read: 1.247 s

    Don't know if it is worth bothering. May be increasing the maximum chunk size improves this - but I didn't try that out yet.

    WRT to seeking:

    I now have two patches that eliminate the need for seek() on normal operation (rewind obviously still needs seek()). Both are based on the PaddedFile class. The first patch just creates a PaddedFile object while switching from an old to a new member while the second just wraps the fileobj all the time. Performance test show that wrapping is cheap. The first patch is a bit ugly while the second requires a implementation of seek() and may create problems if new methods of the fileobj are used that may interfere with the PaddedFile's internals.

    So I leave the choice which one is preferred to the module owner.

    The patch creates another problem with is not yet fixed: The implementation of .seekable() is becoming wrong. As one can now use non seekable files the implementation should check if the file object used for reading is really seekable. As this is my first PY3k work I'd prefer if this can be solved by someone else (But that should be pretty easy).

    @pitrou
    Copy link
    Member

    pitrou commented Sep 23, 2010

    Thank you very much! I have kept the second approach (use PaddedFile at all times), since it is more regular and minimizes the probability for borderline cases.
    As for the supposed performance slowdown, it doesn't seem significant. On large blocks of data, I expect that compression/decompression cost will be overwhelming anyway.
    I've added a test case and committed the patch in r84976. Don't hesitate to contribute again.

    @pitrou pitrou closed this as completed Sep 23, 2010
    @pitrou pitrou closed this as completed Sep 23, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants