Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GZipFile.readline too slow #51720

Closed
asnakelover mannequin opened this issue Dec 10, 2009 · 27 comments
Closed

GZipFile.readline too slow #51720

asnakelover mannequin opened this issue Dec 10, 2009 · 27 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@asnakelover
Copy link
Mannequin

asnakelover mannequin commented Dec 10, 2009

BPO 7471
Nosy @pitrou, @jackdied, @briancurtin
Files
  • gzip_7471_patch.diff: Patch for Python 3.2
  • gzip_7471_io_patch.diff: Patch for Python 2.7 using the io module.
  • gzip_7471_py27.diff: Combined patch for Python 2.7
  • gzip_7471_py32.diff: Combined patch for Python 3.2
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-01-03.22:40:04.182>
    created_at = <Date 2009-12-10.16:11:23.030>
    labels = ['library', 'performance']
    title = 'GZipFile.readline too slow'
    updated_at = <Date 2010-01-03.22:40:04.181>
    user = 'https://bugs.python.org/asnakelover'

    bugs.python.org fields:

    activity = <Date 2010-01-03.22:40:04.181>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-01-03.22:40:04.182>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2009-12-10.16:11:23.030>
    creator = 'asnakelover'
    dependencies = []
    files = ['15536', '15575', '15619', '15620']
    hgrepos = []
    issue_num = 7471
    keywords = ['patch']
    message_count = 27.0
    messages = ['96204', '96205', '96206', '96217', '96219', '96220', '96221', '96222', '96234', '96272', '96326', '96409', '96412', '96413', '96416', '96487', '96537', '96538', '96539', '96554', '96565', '96566', '96643', '96650', '96662', '96663', '97182']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'jackdied', 'nirai', 'brian.curtin', 'asnakelover']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue7471'
    versions = ['Python 2.7', 'Python 3.2']

    @asnakelover
    Copy link
    Mannequin Author

    asnakelover mannequin commented Dec 10, 2009

    It's not a big problem because we can just shell to zcat but compare
    these two:

    $ time python ./b.py >/dev/null

    real 0m10.977s
    user 0m7.128s
    sys 0m0.888s
    $ time python ./a.py >/dev/null

    real 1m19.015s
    user 1m18.185s
    sys 0m0.072s

    $ # Notice that the gzip module (a.py) had the benefit of the files
    being in a disk cache by too now...
    
    $ cat a.py 
    import gzip
    import os

    apt_cache_dir = "/var/cache/apt/apt-file"

    for apt_cache_file in os.listdir(apt_cache_dir):
        if not apt_cache_file.endswith(".gz"):
            continue
        f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
        for line in f:
            print line
    $ cat b.py 
    import os
    import subprocess
    from cStringIO import StringIO

    apt_cache_dir = "/var/cache/apt/apt-file"

    for apt_cache_file in os.listdir(apt_cache_dir):
        if not apt_cache_file.endswith(".gz"):
            continue
        p = subprocess.Popen(["zcat", os.path.join(apt_cache_dir,
    apt_cache_file)],
            stdout = subprocess.PIPE)
        f = StringIO(p.communicate()[0])
        assert p.returncode == 0 
        for line in f:
            print line

    Also tried this one just for "completeness":

    $ cat c.py 
    import gzip
    import os
    from cStringIO import StringIO

    apt_cache_dir = "/var/cache/apt/apt-file"

    for apt_cache_file in os.listdir(apt_cache_dir):
        if not apt_cache_file.endswith(".gz"):
            continue
        f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
        f = StringIO(f.read())
        for line in f:
            print line

    But after it had ran (with some thrashing) for 3 and a half minutes I
    killed it.

    @asnakelover asnakelover mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Dec 10, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Dec 10, 2009

    How does the following compare?

    f = gzip.open(...)
    s = f.read()
    for line in s.splitlines():
        print line

    @pitrou
    Copy link
    Member

    pitrou commented Dec 10, 2009

    (GZipFile.readline() is implemented in pure Python, which explains why
    it's rather slow)

    @pitrou pitrou changed the title gzip module too slow GZipFile.readline too slow Dec 10, 2009
    @asnakelover
    Copy link
    Mannequin Author

    asnakelover mannequin commented Dec 10, 2009

    Hope this reply works right, the python bug interface is a bit confusing
    for this newbie, it doesn't say "Reply" anywhere - sorry if it goes FUBAR.

    I tried the splitlines() version you suggested, it thrashed my machine
    so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
    about 1 minute so I didn't lose anything important. About half a minute
    later the machine came back to life. In other words: the splitlines
    version used way, way too much memory - far worse even than making a
    cStringIO from a GzipFile instance.read().

    It's not just a GzipFile.readline() issue either, c.py calls .read() and
    tries to turn the result into a cStringIO and that was the worst one of
    my three previous tests. I'm going to look at this purely from a
    consumer angle and not even look at gzip module source, from this angle
    (a consumer), zcat out performs it by a factor of 10 when gzip module is
    used with .readline() and by a good deal more when I try to read the
    whole gzip file as a string to turn into a cStringIO to emulate as
    closely as possible what happens with forking a zcat process. When I
    tried to splitlines() it was even worse. This is probably a RAM issue,
    but it just brings us back to - should gzip module eat so much ram when
    shelling out to zcat uses far less?

    @pitrou
    Copy link
    Member

    pitrou commented Dec 10, 2009

    I tried the splitlines() version you suggested, it thrashed my machine
    so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after
    about 1 minute so I didn't lose anything important.

    This sounds very weird. How much memory do you have, and how large are
    the .gz files you are reading? What is your system exactly?

    @asnakelover
    Copy link
    Mannequin Author

    asnakelover mannequin commented Dec 10, 2009

    The gz in question is 17mb compressed and 247mb uncompressed. Calling
    zcat the python process uses between 250 and 260 mb with the whole
    string in memory using zcat as a fork. Numbers for the gzip module
    aren't obtainable except for readline(), which doesn't use much memory
    but is very slow. Other methods thrash the machine to death.

    The machine has 300mb free RAM from a total of 1024mb.

    It's not a new issue, I didn't find it when searching python bug
    archives for some reason but google:

    http://www.google.com/search?q=python+gzip+slow+zcat

    It's an issue which people have stumbled into before but nobody seems to
    have made headway any headway (I expect they just forked and called zcat
    or gzip, like I did...).

    It's a minor point since it's so easy to work around, in fact I have my
    IDS system completed already - thank you so much python for being so RAD
    friendly - yet the module is worthless if a user has zcat/gzip c
    programs and knows they exist as it stands.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 10, 2009

    The gz in question is 17mb compressed and 247mb uncompressed. Calling
    zcat the python process uses between 250 and 260 mb with the whole
    string in memory using zcat as a fork. Numbers for the gzip module
    aren't obtainable except for readline(), which doesn't use much memory
    but is very slow. Other methods thrash the machine to death.

    The machine has 300mb free RAM from a total of 1024mb.

    That would be the explanation. Reading the whole file at once and then
    doing splitlines() on the result consumes twice the memory, since a list
    of lines must be constructed while the original data is still around. If
    you had more than 600 MB free RAM the splitlines() solution would
    probably be adequate :-)

    Doing repeated calls to splitlines() on chunks of limited size (say 1MB)
    would probably be fast enough without using too much memory. It would be
    a bit less trivial to implement though, and it seems you are ok with the
    subprocess solution.

    @asnakelover
    Copy link
    Mannequin Author

    asnakelover mannequin commented Dec 10, 2009

    Yes, subprocess works fine and was the quickest to implement and
    probably the fastest to run too.

    How can I put this without being an ass? Hell, I'm no good at diplomacy

    • the gzip module blows chunks - if I can shell out to a standard unix
      util and it uses a tiny fraction of the memory to do the same job the
      module is inherently broken no matter how pretty it's code looks.

    You can of course tell me to STFU - but look around the web I won't be
    the first or last user to bemoan "import gzip"... :)

    @pitrou
    Copy link
    Member

    pitrou commented Dec 11, 2009

    How can I put this without being an ass? Hell, I'm no good at diplomacy

    • the gzip module blows chunks - if I can shell out to a standard unix
      util and it uses a tiny fraction of the memory to do the same job the
      module is inherently broken no matter how pretty it's code looks.

    Well, let's say it is suboptimal. But it's certainly ok if you don't
    have any tight performance constraints.
    In any case, it won't improve until someone proposes a patch :-)

    @jackdied
    Copy link
    Contributor

    I tried passing a size to readline to see if increasing the chunk helps
    (test file was 120meg with 700k lines). For values 1k-10k all took
    around 30 seconds, with a value of 100 it took 80 seconds, with a value
    of 100k it ran for several minutes before I killed it. The default
    starts at 100 and quickly maxes to 512, which seems to be a sweet spot
    (thanks whomever figured that out!).

    I profiled it and function overhead seems to be the real killer. 30% of
    the time is spent in readline(). The next() function does almost
    nothing and consumes 1/4th the time of readline(). Ditto for read() and
    _unread(). Even lowly len() consumes 1/3rd the time of readline()
    because it is called over 2million times.

    There doesn't seem to be any way to speed this up without rewriting the
    whole thing as a C module. I'm closing the bug WONTFIX.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Dec 13, 2009

    First patch, please forgive long comment :)

    I submit a small patch which speeds up readline() on my data set - a
    74MB (5MB .gz) log file with 600K lines.

    The speedup is 350%.

    Source of slowness is that (~20KB) extrabuf is allocated/deallocated in
    read() and _unread() with each call to readline().

    In the patch read() returns a slice from extrabuf and defers
    manipulation of extrabuf to _read().

    In the following, the first timeit() corresponds to reading extrabuf
    slices while the second timeit() corresponds to read() and _unread() as
    they are done today:

    >>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
    0.25299811363220215
    
    >>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 
    10000").timeit()
    5.843876838684082

    Another speedup is achieved by doing a small shortcut in readline() for
    the typical case in which the entire line is already in extrabuf.

    The patch only addresses the typical case of calling readline() with no
    arguments. It does not address other problems in readline() logic. In
    particular the current 512 chunk size is not a sweet spot. Regardless of
    the size argument passed to readline(), read() will continue to
    decompress just 1024 bytes with each call as the size of extrabuf swings
    around the target size argument as result of the interaction between
    _unread() and read().

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2009

    Ah, a patch. Now we're talking :)

    @pitrou pitrou reopened this Dec 14, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2009

    The patch doesn't apply against the SVN trunk (some parts are rejected).
    I suppose it was done against 2.6 or earlier, but those versions are in
    bug fixing-only mode (which excludes performance improvements), so
    you'll have to regenerate it against the SVN trunk.

    (for info about our SVN repository, see
    http://www.python.org/dev/faq/#how-do-i-get-a-checkout-of-the-repository-read-only-and-read-write
    )

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2009

    Ah, my bad, I hadn't seen that the patch is for 3.2. Sorry for the
    confusion.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2009

    I confirm that the patch gives good speedups. It would be nice if there
    was a comment explaining what extrabuf, extrastart and extrasize are.

    In 3.x, a better but more involved approached would be to rewrite the
    gzip module so as to take advantage of the standard IO classes
    (especially BufferedReader / BufferedWriter), which are written in C and
    are therefore much faster.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Dec 16, 2009

    Right, using the io module makes GzipFile as fast as zcat.

    I submit a new patch this time for Python 2.7, however, it is not a
    module rewrite, but again minimal refactoring.

    GzipFile is now derived of io.BufferedRandom, and as result the
    functionality of GzipFile changed since it used to behave differently
    than io classes. For example write() did not return number of bytes
    written, etc... Also, I hacked it a little to make it work with a raw
    stream which is either readable or writable but not both. If it is
    unacceptable, or if you prefer a module rewrite, I don't mind discussing
    and evolving this further.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 17, 2009

    Thanks for the new patch. The problem with inheriting from
    BufferedRandom, though, is that if you call e.g. write() on a read-only
    gzipfile, it will appear to succeed because the bytes are buffered
    internally.

    I think the solution would be to use delegation rather than inheritance.
    Something like:

        def __init__(self, ...)
            if 'w' in mode:
               self.buf = BufferedWriter(...)
            for meth in ('read', 'write', etc.):
                setattr(self, meth, getattr(self.buf, meth))

    It would also be nice to add some tests for the issues I mentioned
    earlier (check that IOError is raised when reading a write-only file,
    and vice-versa).

    By the way, we can't apply this approach to 2.x since
    BufferedWriter/BufferedRandom won't accept unicode arguments for write()
    and friends, and that would break compatibility. For 3.x it is fine though.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Dec 17, 2009

    How about using the first patch with the slicing optimization and
    additionally enhancing GzipFile with the methods required to make it
    play nice as a raw stream to an io.BufferedReader object (readable(),
    writable(), readinto(), etc...).

    This way we still get the 350% speed up and keep it fully backward
    compatible, and if someone needs the extra performance they can feed it
    into an io.BufferedReader object thus:

    g = gzip.GzipFile(...)
    r = io.BufferedReader(g)
    for line in r:
        ...

    @pitrou
    Copy link
    Member

    pitrou commented Dec 17, 2009

    How about using the first patch with the slicing optimization and
    additionally enhancing GzipFile with the methods required to make it
    play nice as a raw stream to an io.BufferedReader object (readable(),
    writable(), readinto(), etc...).

    That's fine with me.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Dec 18, 2009

    Submitted combined patch for Python 2.7.
    If its good i'll send one for Python 3.2.

    @briancurtin
    Copy link
    Member

    In the test, should you verify that the correct data comes back from
    io.BufferedReader? After the r.close() on line 90 of test_gzip.py,
    adding something like "self.assertEqual("".join(lines), data1*50)" would
    do the trick.
    Looks good.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 18, 2009

    Two things:

    • since it implements common IO operations, the GzipFile class could
      inherit io.BufferedIOBase. It would also alleviate the need to
      reimplement readinto(): BufferedIOBase has a default implementation
      which should be sufficient.
    • rather than type(data) is memoryview, the preferred idiom is
      isinstance(data, memoryview).

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Dec 19, 2009

    isatty() and __iter__() of io.BufferedIOBase raise on closed file and
    __enter__() raises ValueError with different (generic) message.

    Should we keep the original GzipFile methods or prefer the implementation
    of io.BufferedIOBase?

    @pitrou
    Copy link
    Member

    pitrou commented Dec 19, 2009

    isatty() and __iter__() of io.BufferedIOBase raise on closed file and
    __enter__() raises ValueError with different (generic) message.

    Should we keep the original GzipFile methods or prefer the implementation
    of io.BufferedIOBase?

    It's fine to use the BufferedIOBase implementation. There's no reason to
    call isatty() on or iterate over a closed file.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Dec 20, 2009

    uploaded updated patch for Python 2.7.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Dec 20, 2009

    Uploaded patch for Python 3.2.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 3, 2010

    The patches have been committed. Thank you!

    @pitrou pitrou closed this as completed Jan 3, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants