Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bzip2 compression to the zipfile module #58579

Closed
serhiy-storchaka opened this issue Mar 20, 2012 · 17 comments
Closed

Add support for bzip2 compression to the zipfile module #58579

serhiy-storchaka opened this issue Mar 20, 2012 · 17 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

BPO 14371
Nosy @loewis, @jcea, @pitrou, @serhiy-storchaka
Files
  • bzip2_in_zip.patch
  • bzip2_in_zip_review.patch: Regenerate patch with cpython base revision
  • bzip2_in_zip_tests.patch
  • bzip2_in_zip_2.patch
  • bzip2_in_zip_3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-05-01.05:59:21.119>
    created_at = <Date 2012-03-20.11:30:40.849>
    labels = ['type-feature', 'library']
    title = 'Add support for bzip2 compression to the zipfile module'
    updated_at = <Date 2012-05-01.05:59:21.118>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2012-05-01.05:59:21.118>
    actor = 'loewis'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-05-01.05:59:21.119>
    closer = 'loewis'
    components = ['Library (Lib)']
    creation = <Date 2012-03-20.11:30:40.849>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['24956', '24964', '24982', '24996', '25006']
    hgrepos = []
    issue_num = 14371
    keywords = ['patch']
    message_count = 17.0
    messages = ['156394', '156419', '156427', '156436', '156445', '156482', '156594', '156643', '156646', '156647', '156670', '158564', '158618', '158741', '158742', '159743', '159744']
    nosy_count = 7.0
    nosy_names = ['loewis', 'jcea', 'alanmcintyre', 'pitrou', 'nadeem.vawda', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue14371'
    versions = ['Python 3.3']

    @serhiy-storchaka
    Copy link
    Member Author

    ZIP File Format Specification (http://www.pkware.com/documents/casestudies/APPNOTE.TXT) supports bzip2 compression since at least 2003. Since bzip2 contained in Python standart library, it would be nice to add support for these method in zipfile. This will allow to process more foreign zip files and create more compact distributives.

    The proposed patch adds new method ZIP_BZIP2, which is automatically detecting when unpacking and that can be used for packing.

    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Mar 20, 2012
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 20, 2012

    Can you please submit a contributor form?

    http://python.org/psf/contrib/contrib-form/
    http://python.org/psf/contrib/

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 20, 2012

    The patch looks good. Can you also provide a test case?

    @serhiy-storchaka
    Copy link
    Member Author

    I am working on this. Should I add tests to test_zipfile.py or create new test_zipfile_bzip2.py?

    It would add a note that the bzip2 compression can understand not all programs (and do not understand the older versions of Python), but understands the Info-Unzip? My English is not enough for the documentation.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 20, 2012

    Please add it to test_zipfile.

    As for the documentation, I propose the wording

    "bzip2 compression was added to the zip file format in 2001. However, even more recent tools (including older Python releases) may not support it, causing either refusal to process the zip file altogether, or faiilure to extract individual files."

    I'm not a native speaker of English, either. Feel free to put things through Google translate; some native speaker will pick up the text and correct it.

    @serhiy-storchaka
    Copy link
    Member Author

    Thanks to the tests, I found the error. Since the bzip2 is block algorithm, decompressor need to eat a certain amount of data, so it began to return data. Now when reading small chunks turns out premature end of data. I'm working on a fix.

    @serhiy-storchaka
    Copy link
    Member Author

    All errors are fixed. All tests are passed. Unfortunately, the patch was more than expected. This is necessary for correct and effective work with large bzip2 buffers (for other codecs can also be a profit).

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Mar 23, 2012

    [Adding Alan McIntyre, who is listed as zipfile's maintainer.]

    I haven't yet had a chance to properly familiarize myself with the
    zipfile module, but I did notice an issue in the changes to ZipExtFile's
    read() method. The existing code uses the b"".join() idiom for linear-
    time concatenation, but the patch replaces it with a version that does
    "buf += data" after each read. CPython can (I think) do this efficiently,
    but it can be much slower on other implementations.

    Martin:

    As for the documentation, I propose the wording

    "bzip2 compression was added to the zip file format in 2001. However, even more recent tools (including older Python releases) may not support it, causing either refusal to process the zip file altogether, or faiilure to extract individual files."

    How about this?

    "The zip format specification has included support for bzip2 compression
    since 2001. However, some tools (including older Python releases) do not
    support it, and may either refuse to process the zip file altogether, or
    fail to extract individual files."

    @serhiy-storchaka
    Copy link
    Member Author

    The existing code uses the b"".join() idiom for linear-
    time concatenation, but the patch replaces it with a version that does
    "buf += data" after each read.

    You made a mess. The existing code uses buf += data, but I allowed myself
    to replace it with the b"".join() idiom. The bzip2 codec has to deal with
    large pieces of data, now this may be important. In read1 still used buf += data, but not in loop, there is a concatenation of the only two pieces.

    "The zip format specification has included support for bzip2 compression

    Thank you. Can you offer the variant with including both bzip2 and lzma
    (supported since 2006)? I put him in the upcoming patch that adds support for
    lzma compression to the zipfile module.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Mar 23, 2012

    You made a mess. The existing code uses buf += data, but I allowed myself
    to replace it with the b"".join() idiom. The bzip2 codec has to deal with
    large pieces of data, now this may be important. In read1 still used buf += data, but not in loop, there is a concatenation of the only two pieces.

    My mistake; I confused the bodies of read() and read1().

    Thank you. Can you offer the variant with including both bzip2 and lzma
    (supported since 2006)? I put him in the upcoming patch that adds support for
    lzma compression to the zipfile module.

    "The zip format specification has included support for bzip2 compression
    since 2001, and for LZMA compression since 2006. However, some tools
    (including older Python releases) do not support these compression
    methods, and may either refuse to process the zip file altogether,
    or fail to extract individual files."

    @serhiy-storchaka
    Copy link
    Member Author

    Fixed regeression in decompression.

    Nadeem Vawda, we both were wrong. buf += data is noticeably faster b''.join() in CPython.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 17, 2012

    What's the status of your contrib form?

    @pitrou
    Copy link
    Member

    pitrou commented Apr 18, 2012

    buf += data is noticeably faster b''.join() in CPython.

    Perhaps because your system's memory allocator is extremely good (or buf is always very small), but b''.join() is far more robust.
    Another alternative is accumulating in a bytearray, since it uses overallocation for linear time appending.

    @serhiy-storchaka
    Copy link
    Member Author

    What's the status of your contrib form?

    Oops. I put this off for a detailed study and forgotten.

    I will send the form, as only get to the printer and the scanner.

    @serhiy-storchaka
    Copy link
    Member Author

    Perhaps because your system's memory allocator is extremely good (or buf is always very small), but b''.join() is far more robust.
    Another alternative is accumulating in a bytearray, since it uses overallocation for linear time appending.

    I thought, that it was in special optimization, mentioned in the
    python-dev, but could not find this in the code. Perhaps it had not been
    implemented.

    In this particular case, the bytes appending is performed only once (and
    probably a lot of appending with b''). Exceptions are possible only in
    pathological cases, for example when compressed data is much larger
    uncompressed data. The current implementation uses buf += data, if
    someone wants to change it, then it's not me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 1, 2012

    New changeset 028e8e0b03e8 by Martin v. Löwis in branch 'default':
    Issue bpo-14371: Support bzip2 in zipfile module.
    http://hg.python.org/cpython/rev/028e8e0b03e8

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 1, 2012

    Thanks for the patch!

    @loewis loewis mannequin closed this as completed May 1, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants