Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZipFile unzip is unbuffered #54585

Closed
Jimbofbx mannequin opened this issue Nov 9, 2010 · 12 comments
Closed

ZipFile unzip is unbuffered #54585

Jimbofbx mannequin opened this issue Nov 9, 2010 · 12 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-IO

Comments

@Jimbofbx
Copy link
Mannequin

Jimbofbx mannequin commented Nov 9, 2010

BPO 10376
Nosy @pitrou, @IIIIllllIIIIllllIIIIllllIIIIllllIIIIll, @serhiy-storchaka
Files
  • zipfiletest.py
  • zipfile_optimize_read.patch
  • zipfile_optimize_read.patch: regenerate patch for review (without manually deleted chunks)
  • zipfile_optimize_read_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-06-23.14:51:21.731>
    created_at = <Date 2010-11-09.15:51:48.085>
    labels = ['library', 'expert-IO', 'performance']
    title = 'ZipFile unzip is unbuffered'
    updated_at = <Date 2012-06-23.14:51:21.730>
    user = 'https://bugs.python.org/Jimbofbx'

    bugs.python.org fields:

    activity = <Date 2012-06-23.14:51:21.730>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-06-23.14:51:21.731>
    closer = 'pitrou'
    components = ['Library (Lib)', 'IO']
    creation = <Date 2010-11-09.15:51:48.085>
    creator = 'Jimbofbx'
    dependencies = []
    files = ['25432', '25530', '25565', '25769']
    hgrepos = []
    issue_num = 10376
    keywords = ['patch']
    message_count = 12.0
    messages = ['120871', '120873', '159603', '159767', '160377', '160542', '161985', '162831', '163582', '163603', '163616', '163618']
    nosy_count = 7.0
    nosy_names = ['pitrou', 'nadeem.vawda', 'docs@python', 'Jimbofbx', 'xuanji', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue10376'
    versions = ['Python 3.3']

    @Jimbofbx
    Copy link
    Mannequin Author

    Jimbofbx mannequin commented Nov 9, 2010

    The Unzip module is always unbuffered (tested v.3.1.2 Windows XP, 32-bit). This means that if one has to do many small reads it is a lot slower than reading a chunk of data to a buffer and then reading from that buffer. It seems logical that the unzip module should default to buffered reading and/or have a buffered argument. Likewise, the documentation should clarify that there is no buffering involved when doing a read, which runs contrary to the default behavior of a normal read.

    start Zipfile read
    done
    27432 reads done
    took 0.859 seconds
    start buffered Zipfile read
    done
    27432 reads done
    took 0.072 seconds
    start normal read (default buffer)
    done
    27432 reads done
    took 0.139 seconds
    start buffered normal read
    done
    27432
    took 0.137 seconds

    @Jimbofbx Jimbofbx mannequin added the docs Documentation in the Doc dir label Nov 9, 2010
    @Jimbofbx Jimbofbx mannequin assigned docspython Nov 9, 2010
    @Jimbofbx Jimbofbx mannequin added stdlib Python modules in the Lib dir topic-IO performance Performance or resource usage labels Nov 9, 2010
    @Jimbofbx
    Copy link
    Mannequin Author

    Jimbofbx mannequin commented Nov 9, 2010

    I should clarify that this is the zipfile constructor I am using:

    zipfile.ZipFile(filename, mode='r', allowZip64=True);

    @serhiy-storchaka
    Copy link
    Member

    Actually reading from the zip file is buffered (at least 4 KiB of uncompressed data at a time).

    Can you give tests, scripts and data, which show the problem?

    @Jimbofbx
    Copy link
    Mannequin Author

    Jimbofbx mannequin commented May 1, 2012

    See attached, which will open a zipfile that contains one file and reads it a bunch of times using unbuffered and buffered idioms. This was tested on windows using python 3.2

    You're in charge of coming up with a file to test it on. Sorry.

    Example output:

    Enter filename: test.zip
    Timing unbuffered read, 5 bytes at a time. 10 loops
    took 6.671999931335449
    Timing buffered read, 5 bytes at a time (4000 byte buffer). 10 loops
    took 0.7350001335144043

    @serhiy-storchaka
    Copy link
    Member

    This is not because zipfile module is unbuffered. This is the difference between expensive function call and cheap bytes slicing. Replace zf.open(namelist [0]) to io.BufferedReader(zf.open(namelist [0])) to see the effect of a good buffering. In 3.2 zipfile read() implemented not optimal, so it slower (twice), but in 3.3 it will be almost as fast as using io.BufferedReader. It is still several times more slowly than bytes slicing, but there's nothing you can do with it.

    Here is a patch, which is speeds up (+20%) the reading from a zip file by small chunks. Microbenchmark:

    ./python -m zipfile -c test.zip python
    ./python -m timeit -n 1 -s "import zipfile;zf=zipfile.ZipFile('test.zip')" "with zf.open('python') as f:" " while f.read(1):pass"

    Python 3.3 (vanilla): 1 loops, best of 3: 36.4 sec per loop
    Python 3.3 (patched): 1 loops, best of 3: 30.1 sec per loop
    Python 3.3 (with io.BufferedReader): 1 loops, best of 3: 30.2 sec per loop
    And, for comparison, Python 3.2: 1 loops, best of 3: 74.5 sec per loop

    @serhiy-storchaka serhiy-storchaka removed the docs Documentation in the Doc dir label May 10, 2012
    @serhiy-storchaka
    Copy link
    Member

    Thank you, Martin, now I understood why not work Rietveld review.

    @serhiy-storchaka
    Copy link
    Member

    The patch updated to reflect Martin's stylistic comments.

    Sorry for the delay, Martin. I have not received an email with your review from 2012-05-13, and only today accidentally discovered your comments in Rietveld. It seems to have been some bug in Rietveld.

    @serhiy-storchaka
    Copy link
    Member

    Martin, now the patch is good?

    @serhiy-storchaka
    Copy link
    Member

    Any chance to commit the patch before final feature freeze?

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Jun 23, 2012

    Patch looks fine to me.

    Antoine, can you commit this? I'm currently away from the computer that
    has my SSH key on it.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 23, 2012

    New changeset 0e8285321659 by Antoine Pitrou in branch 'default':
    On behalf of Nadeem Vawda: issue bpo-10376: micro-optimize reading from a Zipfile.
    http://hg.python.org/cpython/rev/0e8285321659

    @pitrou
    Copy link
    Member

    pitrou commented Jun 23, 2012

    Antoine, can you commit this?

    Ok, done.

    @pitrou pitrou closed this as completed Jun 23, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir topic-IO
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants