classification
Title: Supporting lzma compression in zip files
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alanmcintyre, eric.araujo, loewis, nadeem.vawda, pitrou, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2012-03-18 23:10 by serhiy.storchaka, last changed 2012-05-13 10:12 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
bzip2_and_lzma_in_zip.patch serhiy.storchaka, 2012-03-18 23:10 review
sample-bzip2.zip serhiy.storchaka, 2012-03-18 23:31 Sample zip file with bzip2 compression. Can be unpacked by Info-Unzip, 7-Zip, WinZip, etc.
sample-lzma.zip serhiy.storchaka, 2012-03-18 23:33 Sample zip file with lzma compression. Can be unpacked by 7-Zip, WinZip, etc. Future Info-Unzip 6.1 will unpack it.
bzip2_and_lzma_in_zip_3.patch serhiy.storchaka, 2012-03-23 19:22 Unified patch for bzip2 and lzma compression review
lzma_in_zip_2.patch serhiy.storchaka, 2012-05-07 19:34 review
lzma_in_zip_3.patch serhiy.storchaka, 2012-05-10 20:47 review
Messages (22)
msg156288 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-18 23:10
ZIP files specification supports new compression algorithms since 2006. Since bzip2 and lzma now contained in Python standart library, it would be nice to add support for these methods in zipfile. This will allow to process more foreign zip files and create more compact distributives.

The proposed patch adds two new methods ZIP_BZIP2 and ZIP_LZMA, which are automatically detecting when unpacking and that can be used for packing.
msg156291 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-18 23:26
This is not completed patch yet, without tests and documentation. I would like to receive feedback before the end of polishing. It might be worth transfer a portion of code in _lzmamodule.c for better use of capacity of LZMA API (used in zip format somewhat differs from LZMA_ALONE).
msg156295 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-19 00:21
ISTM that the LZMA support differs from the specification, see

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

In particular, there appears to be no support for the EOS marker, which should be emitted when compressing.

Changing the LZMA module is fine as long as it
a) happens before the release of 3.3, and
b) is truly justified by the ZIP spec

I also recommend to split this issue into two: bzip support and lzma support. Adding bzip support might be easier.
msg156297 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-19 00:32
I also think that create_version and extract_version need to be adjusted. 

Since LZMA is version 6.3, we need to check for any features that might be in a zip file of extract version 6.3 or lower that we do not support (such as PPMd+ compression, strong encryption, etc.). In general, if we claim to support version x.y, we need to recognize that a feature is used that is supported for x1.y1 (x1.y1 <= x.y) even if we don't support the feature.
msg156340 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-19 14:13
Thank you, Martin, for review and advices.

Lzma in zip format: 2-bytes version (LZMA SDK version, it has not relations with version of XZ Utils used by lzma module), 2-bytes properties size (I have not seen a value other than 5), N-bytes (N=5) property data, and raw compressed data (LZMA_RAW).

Lzma file format (LZMA_ALONE): 5-bytes property data, 8-bytes uncompressed size (~0 if unknown), and raw compressed data (LZMA_RAW).

7-Zip ignores version and supports only 5-bytes property data. Because the LZMA1 codec is declared obsolete, it is highly unlikely for new versions with properties size != 5. Nevertheless, it would be wise to create a lzma module functions for parsing the bytes to the codec properties and for dumping the codec properties to the bytes (this is functions lzma_lzma_props_encode() and lzma_lzma_props_decode() in liblzma). It is not necessary but desirable. I see no other reasonable choice but to hardcode some arbitrary version in the compressing and to ignore it in the decompressing.

This EOS marker is only helpful for stream zip-files when the size of the compressed data is not known beforehand and it is not possible to specify the following (see lzma-file-format.txt in liblzma docs). But that's must be another issue, the current implementation of the zipfile module does not work with non-seekable files (I hope to work on it later).

> I also recommend to split this issue into two: bzip support and lzma support.

Assuredly. I will create a new issue for bzip2, but what do I do with lzma? Do I need to rename this issue or create a new one? Does the lzma patch include the bzip2 patch, because the latter will contain the code necessary to support all codecs? Or should defer any work with lzma until the bzip2 support will commited?

I think we should add the ability to register new codecs. Support for PPMd, jpeg and WavPack is unlikely to emerge in the Python in the foreseeable future, but users of third-party libraries (such as PIL), will use the new codecs as needed.

> I also think that create_version and extract_version need to be adjusted.

Agree. Should we raise an exception when using new compressor if allowZip64 == False? Or set allowZip64 = True, if we explicitly use the new compressor?
msg156370 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-20 00:12
For EOS, please re-read the specification. If you then still think it is not needed, read it again :-) The documentation in liblzma is irrelevant, only the PKWARE specification matters. Take particular notice of the phrase "implementers should include the EOS marker whenever possible"

For bzip: propose a patch that does just the bzip stuff, and any infrastructure changes needed for it. Having the LZMA patch depend on this is fine.

Re: extensible compressors. I don't think that's needed. There is only a finite set, and if somebody wants to support some compression method, they should submit a patch.

Re: allowZip64. This depends on whether you create or extract. Not using a feature on creation is fine - we don't *have* to use all supported features. On extraction, if a feature is used and we support it, it should get used regardless of any configuration (note: I didn't check what allowZip64 currently does).

Re: 7zip. What it does is irrelevant. The ZIP format is defined by PKWARE, so if you want to look at a reference implementation, use theirs. Else use the spec.
msg156395 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-20 11:33
Issue #14371: Add support for bzip2 compression to the zipfile module.
msg156644 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-03-23 10:35
I plan on doing a review of the patch, but it might be a week or two
before I have time to do it.

Regarding changes to lzma; exactly what is being proposed? If it's just
additional functions for encoding and decoding of filter specs, then I'm
fine with that (and it's not *necessary* to get it into 3.3, though it
would still be nice).
msg156674 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-23 19:31
> For EOS, please re-read the specification.

Well, nothing prevents the setting of this bit. Lzma raw compressor already appends EOS.
msg156676 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-23 19:48
> Regarding changes to lzma; exactly what is being proposed?

Yes, it's just additional functions for encoding and decoding of filter specs. But this is not necessary, the current implementation uses own functions (need the support of 
only LZMA1 format, which is unlikely to change).
msg159745 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-01 08:33
Here is the patch which adds support for lzma in zipfile. Later I'll
move `lzma_props_encode` and `lzma_props_decode` to lzma module.
msg159746 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-01 08:37
Note that the supporting of bzip2 increases the time of testing
test_zipfile in 1.5x, and lzma -- 2x (total 3x). May be not all tests
are needed?
msg159806 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-02 16:04
I'm not sure where the test run time actually comes from. It's certainly desirable to reduce the test run time, as long as all test purposes are preserved. It's not necessary to do redundant tests, e.g. if some test isn't about compression, there is no point in running it with all compressors (again, I'm not sure whether this actually happens).
msg159810 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-02 16:56
> Note that the supporting of bzip2 increases the time of testing
> test_zipfile in 1.5x, and lzma -- 2x (total 3x). May be not all tests
> are needed?

I think it's ok. Some of our tests run quite longer than that; and better to be exhaustive than fast (even though exhaustive *and* fast is better, of course).
msg160052 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-05-06 01:11
I've put together a patch for the lzma module adding functions for
encoding and decoding filter properties (see issue 14736). It's a bit
bulky, though, so I'd like to get a review before committing it.
msg160109 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-05-06 21:13
I've committed my patch as changeset 9118ef2b651a, adding functions
encode_filter_properties and decode_filter_properties to the lzma module.
msg160163 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-07 19:34
The patch updated to use functions encode_filter_properties and decode_filter_properties from the lzma module. Thank you, Nadeem Vawda.
msg160341 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-10 12:52
Serhiy, just to be sure we communicated well: I'm not asking that all the missing features for a 6.3 level zip library must be implemented. Instead, we need to detect whether an unsupported feature is used, and raise an exception reporting what the feature is that is unsupported. This is necessary as we otherwise may return incorrect data.
msg160373 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-10 20:47
I understand you, Martin. The time it took me to read the specification
and understand where should be the checks in the module. The patch
updated. The list of compression methods extended (in the future it can
be used for detailed output in the printdir()), flag of strict
encryption is checked directly, encrypted or compressed central
directory just will not be found (BadZipFile will be raised).
msg160509 - (view) Author: Roundup Robot (python-dev) Date: 2012-05-13 08:06
New changeset fccdcd83708a by Martin v. Löwis in branch 'default':
Issue #14366: Support lzma compression in zip files.
http://hg.python.org/cpython/rev/fccdcd83708a
msg160510 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 08:07
Thanks for the patch!
msg160514 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 10:12
Thank you, Martin. Both of these patches basically your merit.

Maybe take a look also at the small patches to issue14315 and
issue10376?
History
Date User Action Args
2012-05-13 10:12:57serhiy.storchakasetmessages: + msg160514
2012-05-13 08:07:16loewissetstatus: open -> closed
resolution: fixed
messages: + msg160510
2012-05-13 08:06:45python-devsetnosy: + python-dev
messages: + msg160509
2012-05-10 20:47:25serhiy.storchakasetfiles: + lzma_in_zip_3.patch

messages: + msg160373
2012-05-10 12:52:27loewissetmessages: + msg160341
2012-05-07 19:38:00serhiy.storchakasetfiles: - lzma_in_zip.patch
2012-05-07 19:34:58serhiy.storchakasetfiles: + lzma_in_zip_2.patch

messages: + msg160163
2012-05-06 21:13:59nadeem.vawdasetmessages: + msg160109
2012-05-06 01:11:32nadeem.vawdasetmessages: + msg160052
2012-05-02 16:56:41pitrousetnosy: + pitrou
messages: + msg159810
2012-05-02 16:04:41loewissetmessages: + msg159806
2012-05-01 08:37:57serhiy.storchakasetmessages: + msg159746
2012-05-01 08:33:13serhiy.storchakasetfiles: + lzma_in_zip.patch

messages: + msg159745
2012-05-01 06:47:32loewissettitle: Supporting bzip2 and lzma compression in zip files -> Supporting lzma compression in zip files
2012-03-23 19:48:39serhiy.storchakasetmessages: + msg156676
2012-03-23 19:31:25serhiy.storchakasetmessages: + msg156674
2012-03-23 19:22:05serhiy.storchakasetfiles: + bzip2_and_lzma_in_zip_3.patch
2012-03-23 10:35:57nadeem.vawdasetmessages: + msg156644
2012-03-20 11:33:02serhiy.storchakasetmessages: + msg156395
2012-03-20 00:12:48loewissetmessages: + msg156370
2012-03-19 14:13:14serhiy.storchakasetmessages: + msg156340
2012-03-19 00:32:13loewissetmessages: + msg156297
2012-03-19 00:21:46loewissetnosy: + loewis
messages: + msg156295
2012-03-18 23:34:18eric.araujosetnosy: + eric.araujo

title: Suporting bzip2 and lzma compression in zip files -> Supporting bzip2 and lzma compression in zip files
2012-03-18 23:33:57serhiy.storchakasetfiles: + sample-lzma.zip
2012-03-18 23:31:48serhiy.storchakasetfiles: + sample-bzip2.zip
2012-03-18 23:26:09serhiy.storchakasetmessages: + msg156291
2012-03-18 23:13:43nadeem.vawdasetnosy: + alanmcintyre, nadeem.vawda
2012-03-18 23:10:35serhiy.storchakacreate