Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support xz compression in tarfile module #49939

Closed
doko42 opened this issue Apr 4, 2009 · 38 comments
Closed

Support xz compression in tarfile module #49939

doko42 opened this issue Apr 4, 2009 · 38 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@doko42
Copy link
Member

doko42 commented Apr 4, 2009

BPO 5689
Nosy @loewis, @birkenfeld, @doko42, @pfmoore, @amauryfa, @gustaebel, @pitrou, @vstinner, @merwok
Dependencies
  • bpo-6715: xz compressor support
  • Files
  • 2011-09-15-tarfile-lzma.diff
  • 2011-12-08-tarfile-lzma.diff
  • lzma-preset.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gustaebel'
    closed_at = <Date 2012-03-06.11:32:17.456>
    created_at = <Date 2009-04-04.16:08:34.510>
    labels = ['type-feature', 'library']
    title = 'Support xz compression in tarfile module'
    updated_at = <Date 2012-03-06.11:32:17.455>
    user = 'https://github.com/doko42'

    bugs.python.org fields:

    activity = <Date 2012-03-06.11:32:17.455>
    actor = 'nadeem.vawda'
    assignee = 'lars.gustaebel'
    closed = True
    closed_date = <Date 2012-03-06.11:32:17.456>
    closer = 'nadeem.vawda'
    components = ['Library (Lib)']
    creation = <Date 2009-04-04.16:08:34.510>
    creator = 'doko'
    dependencies = ['6715']
    files = ['23162', '23880', '24086']
    hgrepos = []
    issue_num = 5689
    keywords = ['patch']
    message_count = 38.0
    messages = ['85403', '85424', '85468', '86197', '86205', '106425', '106426', '106469', '106482', '106517', '106535', '106553', '106566', '106573', '144084', '148661', '148716', '148746', '148755', '149020', '149021', '149040', '149180', '149182', '149184', '149325', '149331', '149807', '150085', '150088', '150111', '150119', '150161', '150165', '150167', '150184', '151485', '151535']
    nosy_count = 16.0
    nosy_names = ['loewis', 'georg.brandl', 'doko', 'paul.moore', 'amaury.forgeotdarc', 'lars.gustaebel', 'pitrou', 'vstinner', 'koen', 'nadeem.vawda', 'eric.araujo', 'v+python', 'proyvind', 'itkach', 'eysispeisi', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue5689'
    versions = ['Python 3.3']

    @doko42
    Copy link
    Member Author

    doko42 commented Apr 4, 2009

    GNU tar now supports lzma compression as a compression method. Please
    consider adding lzma support to the tarfile module (either by using the
    external lzma program or by adding a lzma extension to the standard
    library).

    lzma extension at http://svn.fancycode.com/repos/python/pylzma/trunk/

    lzma is used in many tools (7zip, dpkg, rpm), offers faster
    decompression than bzip2, slower compression than gzip and bzip2.

    @doko42 doko42 added extension-modules C modules in the Modules dir type-feature A feature request or enhancement labels Apr 4, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 4, 2009

    As for an lzma module - I would prefer one that isn't LGPL'ed. Instead,
    it should link against a system-provide lzma library (which then might
    or might not licensed under lpgl). I would probably exclude the lzma
    module from Windows, as distributing the lzma sources along with the
    Python binaries is too painful.

    @birkenfeld
    Copy link
    Member

    If we support LZMA, we should do so on all platforms; it kind of
    restricts usefulness to only have it on some. Maybe the LZMA code in
    one of the many archival tools in existence that supports it is not LGPL'd?

    @koen
    Copy link
    Mannequin

    koen mannequin commented Apr 20, 2009

    The LZMA implementation from 7-zip has been released as public domain
    (since version 4.62 / Nov 2008) in the LZMA SDK: http://www.7-zip.org/
    sdk.html
    So, there shouldn't be a license issue for Windows. I am not sure if
    there are already system-provided LZMA libraries on Linux at this time.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 20, 2009

    The LZMA implementation from 7-zip has been released as public domain
    (since version 4.62 / Nov 2008) in the LZMA SDK: http://www.7-zip.org/
    sdk.html

    That's good news. Now, if somebody could contribute a Python wrapper for
    these...

    So, there shouldn't be a license issue for Windows. I am not sure if
    there are already system-provided LZMA libraries on Linux at this time.

    There are. The Linux version apparently originates from the same
    sources, so they might be API compatible. However, I wouldn't mind
    if we extracted the entire lzma library from 7zip, and put it into
    the source distribution.

    @loewis loewis mannequin changed the title please support lzma compression as an extension and in the tarfile module please support lzma compression as an extension and in the tarfile module Apr 20, 2009
    @proyvind
    Copy link
    Mannequin

    proyvind mannequin commented May 25, 2010

    I'm the author of the pyliblzma module, and if desired, I'd be happy to help out adapting pyliblzma for inclusion with python.
    Most of it's code is based on bz2module.c, so it shouldn't be very far away from being good 'nuff.
    What I see as required is:

    • clean out use of C99 types etc.
    • clean up the LZMAOptions class (this is the biggest difference from the bz2 module, as the filter supports a wide range of various options, everything related such as parsing, api documentation etc. was placed in it's own class, I've yet to receive any feedback on this decission or find any remote equivalents out there to draw inspiration from;)
    • While most of the liblzma API has been implemented, support for multiple/alternate filters still remains to be implemented. When done it will also cause some breakage with the current pyliblzma API.

    I plan on doing these things sooner or later anyways, it's pretty much just a matter of motivation and priorities standing in the way, actual interest from others would certainly have a positive effect on this. ;)

    For other alternatives to the LGPL liblzma, you really don't have any, keep in mind that LZMA is "merely" the algorithm, while xz (and LZMA_alone, used for '.lzma', now obsolete, but still supported) are the actual format you want support for. The LZMA SDK does not provide any compatibility for this.

    @proyvind
    Copy link
    Mannequin

    proyvind mannequin commented May 25, 2010

    ps: pylzma uses the LZMA SDK, which is not what you want.
    pyliblzma (not the same module;) OTOH uses liblzma, which is the library used by xz/lzma utils

    You'll find it available at http://launchpad.net/pyliblzma

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 25, 2010

    For other alternatives to the LGPL liblzma, you really don't have
    any,

    If that's really the case (which I don't believe it is), then this
    project stops right here. If the underlying library is LGPL, it would
    require us to distribute its sources along with the Windows binaries,
    which I'm not willing to do.

    @koen
    Copy link
    Mannequin

    koen mannequin commented May 25, 2010

    The XZ Utils website ( http://tukaani.org/xz/ ) states the following:

    "The most interesting parts of XZ Utils (e.g. liblzma) are in the public domain. You can do whatever you want with the public domain parts.

    Some parts of XZ Utils (e.g. build system and some utilities) are under different free software licenses such as GNU LGPLv2.1, GNU GPLv2, or GNU GPLv3."

    So, liblzma is not the problem. But the license of PylibLZMA is LGPL3.

    @pitrou
    Copy link
    Member

    pitrou commented May 26, 2010

    If the underlying library is LGPL, it would
    require us to distribute its sources along with the Windows binaries,
    which I'm not willing to do.

    Martin, this is wrong, you don't have to bundle the source *in* the object code package. Making it available on some HTTP or FTP site is sufficient.
    (actually, if we don't modify the library source, we can even point at the original download site)

    @doko42
    Copy link
    Member Author

    doko42 commented May 26, 2010

    Per, on 2010-03-17, I asked you via email:

    "I was looking at

    http://bugs.python.org/issue5689
    http://bugs.python.org/issue6715

    and Martin's comments about the licensing of the bindings; is there a special reason for the lgpl3 license of the bindings, given that both python and xz-utils are not gpl'ed?"

    Does pyliblzma need to be licensed under the lgpl3?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 26, 2010

    Am 26.05.2010 10:51, schrieb Antoine Pitrou:

    Antoine Pitrou<pitrou@free.fr> added the comment:

    > If the underlying library is LGPL, it would
    > require us to distribute its sources along with the Windows binaries,
    > which I'm not willing to do.

    Martin, this is wrong, you don't have to bundle the source *in* the object code package.

    That's why I said "along". I'm still not willing to do that: making the
    source available is still inconvenient. More importantly, anybody
    redistributing Python binaries would have to comply also (e.g. on
    CD-ROMs or py2exe binaries); this is a burden I don't want to impose
    on our users. Fortunately, we don't have to, as the LZMA compression
    itself is in the public domain. For the Python wrapper, I hope that
    somebody contributes such a module under a PSF contributor agreement.
    If nobody else does, I may write one from scratch one day.

    @proyvind
    Copy link
    Mannequin

    proyvind mannequin commented May 26, 2010

    if you're already looking at bpo-6715, then I don't get why you're asking.. ;)

    quoting from msg106433:
    "For my code, feel free to use your own/any other license you'd like or even public domain (if the license of bz2module.c that much of it's derived from permits of course)!"

    The reason why I picked LGPLv3 in the past was simply just because liblzma at the time was licensed under it, so I just picked the same for simplicity.
    I've actually already dual-licensed it under the python license in addition on the project page though, but I just forgot updating the module's metadata as well before I released 0.5.3 last month..

    Martin: For LGPL (or even GPL for that matter, disregarding linking restrictions) libraries you don't have to distribute the sources of those libraries at all (they're already made available by others, so that would be quite overly redundant, uh?;). LGPL actually doesn't even care at all about the license of your software as long as you only dynamically link against it.

    I don't really get what the issue would be even if liblzma were still LGPL, it doesn't prohibit you from distributing a dynamically linked library along with python either if necessary (which of course would be of convenience on win32..)..

    tsktsk, discussions about python module for xz compression should anyways be kept at bpo-6715 as this one is about the tarfile module ;p

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 26, 2010

    tsktsk, discussions about python module for xz compression should
    anyways be kept at bpo-6715 as this one is about the tarfile module
    ;p

    Ok, following up there.

    @merwok merwok added stdlib Python modules in the Lib dir and removed extension-modules C modules in the Modules dir labels Sep 15, 2011
    @merwok merwok changed the title please support lzma compression as an extension and in the tarfile module Support xz compression in tarfile module Sep 15, 2011
    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Sep 15, 2011

    Attached is a patch with the current state of my work on lzma integration into tarfile (17 test errors).

    @gustaebel gustaebel mannequin self-assigned this Sep 15, 2011
    @merwok
    Copy link
    Member

    merwok commented Nov 30, 2011

    Python now has an lzma module. Lars, do you have the time to update your patch or should I do it?

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Dec 1, 2011

    I will be happy to, but my spare time is limited right now, so this could take about a week. If this is a problem, please go ahead.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 1, 2011

    There is plenty of time until 3.3. OTOH, if Eric wants to work on it now: you got a week :-) Do recognize that there is a patch to start from already.

    @merwok
    Copy link
    Member

    merwok commented Dec 2, 2011

    I’m perfectly happy to let Lars do it next week or next month, there is no rush. The existing patch may even require very little or no change, as Nadeem’s module (in the stdlib) provides the same classes than the other lzma module which was used by the patch.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Dec 8, 2011

    For those who want to test it first, I post the current state of the patch here. It is ready for commit, there are no failing tests. If nobody objects, I will apply it this weekend.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 8, 2011

    Some comments about 2011-12-08-tarfile-lzma.diff:

    elif self.buf.startswith(b"\x5d\x00\x00\x80") or self.buf.startswith(b"...

    Micro-optimization: you can use self.buf.startswith((b"\x5d\x00\x00\x80", b"\xfd7zXZ")) here.

    raise ValueError("mode must be 'r' or 'w'.")

    Error messages usually don't end with a dot (or am I wrong?).

    It would be better to use a skip instead of just return here:

    def test_no_name_argument(self):
      if self.mode.endswith("bz2") or self.mode.endswith("xz"):
        # BZ2File and LZMAFile have no name attribute.
        return

    In _Stream.__init__, for zlib:

    self.exception = zlib.error

    Could you add a test for this change?

    @merwok
    Copy link
    Member

    merwok commented Dec 8, 2011

    Patch looks great. I did a review on Rietveld.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 10, 2011

    New changeset 899a8c7b2310 by Lars Gustäbel in branch 'default':
    Issue bpo-5689: Add support for lzma compression to the tarfile module.
    http://hg.python.org/cpython/rev/899a8c7b2310

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Dec 10, 2011

    Thanks for the review, guys! I can't close this issue yet because it depends on bpo-6715.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Dec 10, 2011

    Great stuff! I'll close this issue along with bpo-6715 once the buildbot
    stuff is all sorted out.

    @merwok
    Copy link
    Member

    merwok commented Dec 12, 2011

    Lars, as part of a small doc patch I want to change this in tarfile.rst:

    The :mod:`tarfile` module makes it possible to read and write tar
    archives, including those using gzip or bz2 compression.
    -(:file:`.zip` files can be read and written using the :mod:`zipfile` module.)
    +Use the :mod:`zipfile` module to read or write :file:`.zip` files, or the
    +higher-level functions in :ref:`shutil <archiving-operations>`.

    Any objection?

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Dec 12, 2011

    Please, go ahead!

    @vstinner
    Copy link
    Member

    There is failure on a XP buildbot. I don't know if it is a sporadic issue or not.

    http://www.python.org/dev/buildbot/all/builders/x86%20XP-5%203.x/builds/3921/steps/test/logs/stdio

    ======================================================================
    ERROR: test_append_lzma (test.test_tarfile.AppendTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\Buildslave\3.x.moore-windows\build\lib\test\test_tarfile.py", line 1539, in test_append_lzma
        self._create_testtar("w:xz")
      File "D:\Buildslave\3.x.moore-windows\build\lib\test\test_tarfile.py", line 1486, in _create_testtar
        with tarfile.open(self.tarname, mode) as tar:
      File "D:\Buildslave\3.x.moore-windows\build\lib\tarfile.py", line 1721, in open
        return func(name, filemode, fileobj, **kwargs)
      File "D:\Buildslave\3.x.moore-windows\build\lib\tarfile.py", line 1826, in xzopen
        mode=mode, fileobj=fileobj, preset=preset)
      File "D:\Buildslave\3.x.moore-windows\build\lib\lzma.py", line 117, in __init__
        preset=preset, filters=filters)
    MemoryError

    @pitrou
    Copy link
    Member

    pitrou commented Dec 22, 2011

    Perhaps Paul can try to reproduce and diagnose the issue directly on the buildbot?

    @pfmoore
    Copy link
    Member

    pfmoore commented Dec 22, 2011

    A simple rebuild and test run of that test in debug mode didn't fail...

    I'll run the full test suite as a check, but that could take some time - that buildslave isn't the fastest in the world...

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Dec 22, 2011

    Not to worry - as I said in my previous message, I can reproduce the error
    on my own XP machine.

    I also noticed that running test_tarfile alone doesn't trigger the errors,
    which leads me to suspect that the failure is due to some interaction with
    another test getting run before test_tarfile. I'm currently trying to
    determine what this test is.

    I suspect that the problem is at least partially caused by the fact that
    tarfile uses a default compresslevel of 9 for .tar.xz archives (rather
    than the recommended value of 6). According to the man page for the xz
    tool <http://manpages.ubuntu.com/manpages/lucid/man1/xz.1.html\>, using a
    compresslevel of 9 can result in memory usage of up to 800MB during
    compression, which is a significant fraction of the bot's 2GB of RAM.
    (I suppose it would be a good idea to mention this in the documentation
    for the lzma module, so users won't get bitten by this...)

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Dec 23, 2011

    Wouldn't it be better then to use a default compresslevel of 6 in tarfile? I used level 9 in my patch without a particular reason, just because I thought 9 must be better than 6 ;-)

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Dec 23, 2011

    Yes, that's a good idea. I've been testing a similar change, and it seems
    to drop the peak memory usage for test_tarfile from around 810MB down to
    under 200MB. It looks like 2GB genuinely isn't enough to reliably use LZMA
    compression with preset=9.

    You might want to use preset=None instead of explicitly saying preset=6,
    though. This tells LZMAFile to use the default preset, and will allow you
    to get rid of the if-statement on lines 1821-1823.

    Something unrelated that I noticed in the surrounding code: gzopen and
    bz2open validate the mode by testing 'len(mode) > 1 or mode not in "rw"'.
    This would be simpler as 'mode not in ("r", "w")' (like you've done in
    xzopen), and it would accept only "r" and "w" (but not "" or "rw").

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Dec 23, 2011

    Yes, that's much better. Thanks for the tip.

    @nadeemvawda
    Copy link
    Mannequin

    nadeemvawda mannequin commented Dec 23, 2011

    Patch looks good to me.

    @amauryfa
    Copy link
    Member

    Ping. Windows buildbots are still failing with MemoryError because of this preset=9.
    The patch looks good to me as well.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 18, 2012

    New changeset b86b54fcb5c2 by Lars Gustäbel in branch 'default':
    Issue bpo-5689: Avoid excessive memory usage by using the default lzma preset.
    http://hg.python.org/cpython/rev/b86b54fcb5c2

    @nadeemvawda nadeemvawda mannequin closed this as completed Mar 6, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants