classification
Title: Patch for zip decryption speedup
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: brian.curtin, christian.heimes, pitrou, r.david.murray, rhdv, serhiy.storchaka, shashank
Priority: normal Keywords: patch

Created on 2010-10-05 20:08 by shashank, last changed 2017-03-30 16:10 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
zipdecrypt.patch shashank, 2010-10-05 20:08
zipdecrypt.patch shashank, 2010-10-07 13:50 Updated patch with conditional use of C implementation
zipdecrypt.patch shashank, 2010-10-12 07:17 modified tests to test both C and Py impl
zipdecrypt.patch shashank, 2010-10-12 07:56 replacing tabs with spaces in the patch
zipdecrypt.patch shashank, 2010-10-23 21:22 updated patch
zipdecrypt.patch shashank, 2010-11-04 16:04 corrected patch
zipdecrypt-3.patch rhdv, 2012-11-03 21:27 Patch for python3 review
zipdecrypt-2.7.patch rhdv, 2012-11-03 21:28 Patch for python 2.7 review
doc.patch rhdv, 2012-11-04 13:20
zipfile_decryptor_speedup.patch serhiy.storchaka, 2012-11-04 16:00 Speedup of Python implementation review
Pull Requests
URL Status Linked Edit
PR 550 merged serhiy.storchaka, 2017-03-07 15:34
Messages (22)
msg118030 - (view) Author: Shashank (shashank) Date: 2010-10-05 20:08
As promised in this thread http://mail.python.org/pipermail/python-dev/2009-August/091450.html (a year ago!), attached is a patch that replaces simple zip decryption logic written in pure python with that in C.

As reported in the link above, this can result in speedups up to couple of orders of magnitude.

There doesn't seem to be any need to add any new tests as this patch doesn't change any public API
msg118039 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-05 22:35
It would be nice to retain the pure python version as a fallback for non CPython implementations, that will require tweaking the tests to make sure both are tested.
msg118110 - (view) Author: Shashank (shashank) Date: 2010-10-07 13:50
I have updated the patch with a check for the availability of C impl and to use pure-py impl as a fallback.

How do you suggest would the tests change? As I had mentioned before, in my understanding since there is no change in the API the already existing tests should work.

One can simulate the absence of C impl in a system where it is present but AFAIU this is not what it is usually done (e.g, in the case of optional zlib dependency in the same module)
msg118112 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-07 15:24
It is what is normally done *now* when there is both a C and a python implementation (see, for example, test_datetime.py and test_io.py for two different approaches to that).  Not all tests have been updated to this practice.

Thanks for working on this.
msg118405 - (view) Author: Shashank (shashank) Date: 2010-10-12 07:17
Attached is a patch with changes in Lib/test/test_zipfile.py to test both C and pure-py impls (on systems where the C impl is present).

Admittedly, this approach to emulating the absence of C impl is a bit hacky. This is primarily because the changed class is not a part of the public API and hence not being tested directly.

David, could you verify that the approach is ok?
msg118420 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-12 12:34
Hello,

Some quick comments:
- the C module should be private and therefore called _zipdecrypt
- if you want to avoid API mismatch, you could give a tp_call to your C decrypter object, rather than a "decrypt" method
- you can put all initialization code in zipdecrypt_new and avoid the need for zipdecrypt_init
- it's better to use the "y*" code in PyArg_ParseTuple, rather than "s#"
- you should define your module as PY_SSIZE_T_CLEAN and use Py_ssize_t as length variables (rather than int)
- you *mustn't* change the contents of the buffer which is given you by "s#" or "y*", since that buffer is read-only (it can be a bytes object); instead, create a new bytes object using PyBytes_FromStringAndSize(NULL, length) and write into that; or, if you want a read-write buffer, use the "w*" code
msg119476 - (view) Author: Shashank (shashank) Date: 2010-10-23 21:22
>the C module should be private and therefore called _zipdecrypt
done

>if you want to avoid API mismatch, you could give a tp_call to your C >decrypter object, rather than a "decrypt" method
done

>- you can put all initialization code in zipdecrypt_new and avoid the >need for zipdecrypt_init
keeping this similar to the existing _ZipDecrypter class in ZipFile (all initialization in init rather than new), which was probably to allow re-init and re-use of one instance

>it's better to use the "y*" code in PyArg_ParseTuple, rather than "s#"
y* does not seem to be available in 2.7, using s* instead

>you should define your module as PY_SSIZE_T_CLEAN and use Py_ssize_t >as length variables (rather than int)
done

>you *mustn't* change the contents of the buffer which is given you by >"s#" or "y*", since that buffer is read-only (it can be a bytes >object); instead, create a new bytes object using >PyBytes_FromStringAndSize(NULL, length) and write into that; or, if >you want a read-write buffer, use the "w*" code
corrected, not altering the input buffer, reading input buffer as s*
msg120417 - (view) Author: Shashank (shashank) Date: 2010-11-04 15:54
I had uploaded an incorrect patch. New corrected patch against trunk (on Mac OS uploaded).
msg174699 - (view) Author: Robert de Vries (rhdv) Date: 2012-11-03 21:27
Attached you will find the updated patch for the python 3 tree as of now.

I have measured a speed-up of more than a factor 100.
msg174700 - (view) Author: Robert de Vries (rhdv) Date: 2012-11-03 21:28
Patch for python 2.7

Same patch as for python 3, backported to python 2.7

Tested on Linux only.
msg174789 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-04 12:17
I quote from Gregory P. Smith (msg91897):

"""
The decryption provided by the zipfile module is for the worthless
32-bit crc based "encryption" of zipfiles.  I think promoting the use of
that is a bad idea.

zipfile can be used by people to get their data out of such files.  We
should not encourage them to put it and/or their code into such a stupid
format.
"""

I think that the effort required for speedup of this almost useless feature is excessive.

If someone want to implement the "strong encryption" for zip files - welcome.
msg174791 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-04 12:27
See also criticism in the original discussion: http://mail.python.org/pipermail/python-dev/2009-August/091450.html .
msg174793 - (view) Author: Robert de Vries (rhdv) Date: 2012-11-04 13:20
My use case is decrypting files of 100's of megabytes. This is so slow that it is quite useless. About an hour or so.

I do agree that the encryption is worthless, but that is not important for my use case where I want to discourage people from reverse engineering the contents.
If it is so dangerous as some people have pointed out, it should be removed at the cost of not supporting a standard feature of ZIP files.
In my opinion you either support a feature and you support it good (efficient) or you don't. As it stands now, users will be disappointed in using a supported feature.

Some people argue that adding C code to Python is dangerous as it will lead to bugs, vulnerabilities etc.
You could dismiss every addition with C code to Python with this argument, so there must be some positive aspects to outweigh the negative side. The negative side is fairly small (50 lines of very simple C code), plus some standard Python glue code.
The benefit is a 100 fold increase in performance and the removal of 1 line of documentation telling that this feature is extremely slow.
(patch attached)
msg174794 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-11-04 13:43
We aren't particularly interested in helping people make their files slightly harder to reverse engineer, either, so I don't think that is a good enough reason for accepting this.  There might be other reasons that are good enough, but I don't think that is one of them.
msg174795 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-11-04 13:48
From the zlib FAQ:

38. How can I encrypt/decrypt zip files with zlib?

    zlib doesn't support encryption. The original PKZIP encryption is very weak
    and can be broken with freely available programs. To get strong encryption,
    use GnuPG, http://www.gnupg.org/ , which already includes zlib compression.
    For PKZIP compatible "encryption", look at http://www.info-zip.org/

I don't see the point of a weak and easily breakable encryption.
msg174803 - (view) Author: Robert de Vries (rhdv) Date: 2012-11-04 14:28
If the encryption is so horrible why is there any support (with bad performance) at all in Python?
It would be better to remove it altogether.
This prevents users from building software using this feature only to find out later how bad the performance is. (This is the main reason why I have submitted this patch.)
If the support had not been in Python I would not have used this feature to begin with.

To reiterate my previous point. Either support something and do it well, or don't. Don't do a half job.

Please note that there are more attempts to fix this, so I am not completely alone here.
http://pypi.python.org/pypi/czipfile
msg174806 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-11-04 14:45
> If the encryption is so horrible why is there any support (with bad
> performance) at all in Python?
> It would be better to remove it altogether.

We don't remove it as it would break existing programs which rely on
this feature. However adding a bunch of C code to improve performance of
such a questionable feature is controversial.

I wouldn't be against the acceleration myself, however I am not
interested in reviewing or accepting it, sorry.
msg174816 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-04 16:00
Here is a patch which optimize (speed up 2x) Python implementation of ZIP decryptor.  It is almost the maximum of what can be achieved without significant degradation of maintainability.

Of course, 2x is less then 100x, but it more portable and costs almost nothing.  If that's not enough, I suggest to use an external unzip. You can even run multiple unzips at a time for the parallel extraction of multiple files.

If in the future someone will implement the strong encryption for ZIP files, it is possible it will required a C accelerator module and it is possible there will be a place for PKWARE's traditional encryption.
msg174823 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-11-04 18:03
> If the encryption is so horrible why is there any support (with bad 
> performance) at all in Python?

I would say it there so that people can use python to "decrypt" an "encrypted" zip archive they have been sent that was generated by some other tool.  I would say this is not a particularly strong use case since there are other tools that can be used for this, but as Antoine said removing the "feature" could break existing working code, so we are unlikely to do it.
msg174830 - (view) Author: Robert de Vries (rhdv) Date: 2012-11-04 20:36
The current situation is now that the decryption is part of Python. It is well known to be computationally intensive and should therefore be implemented in C.

This patch provides that support.

The discussion if Python should support the decryption is behind us, as the support is already in.

The only discussion should be about if there are enough users wanting this performance improvement to add it.

From the download statistics of czipfile I would say that there are roughly 2500 users interested.
msg290497 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-25 20:54
Brian, you requested tests, but PR 550 doesn't add new API and doesn't change the behavior of public API. No new tests are needed.
msg290857 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-30 16:09
New changeset 06e522521c06671b4559eecf9e2a185c2d62c141 by Serhiy Storchaka in branch 'master':
bpo-10030: Sped up reading encrypted ZIP files by 2 times. (#550)
https://github.com/python/cpython/commit/06e522521c06671b4559eecf9e2a185c2d62c141
History
Date User Action Args
2017-03-30 16:10:19serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-03-30 16:09:11serhiy.storchakasetmessages: + msg290857
2017-03-25 20:54:01serhiy.storchakasetnosy: + brian.curtin

messages: + msg290497
versions: + Python 3.7, - Python 3.4
2017-03-07 15:34:15serhiy.storchakasetpull_requests: + pull_request452
2012-11-04 20:36:25rhdvsetmessages: + msg174830
2012-11-04 18:03:34r.david.murraysetmessages: + msg174823
2012-11-04 16:00:58serhiy.storchakasetfiles: + zipfile_decryptor_speedup.patch

messages: + msg174816
2012-11-04 14:45:54pitrousetmessages: + msg174806
2012-11-04 14:28:46rhdvsetmessages: + msg174803
2012-11-04 13:48:57christian.heimessetnosy: + christian.heimes
messages: + msg174795
2012-11-04 13:43:22r.david.murraysetmessages: + msg174794
2012-11-04 13:20:20rhdvsetfiles: + doc.patch

messages: + msg174793
2012-11-04 12:27:51serhiy.storchakasetmessages: + msg174791
2012-11-04 12:17:00serhiy.storchakasetmessages: + msg174789
2012-11-03 22:06:11pitrousetnosy: + serhiy.storchaka

versions: + Python 3.4, - Python 3.2
2012-11-03 21:28:33rhdvsetfiles: + zipdecrypt-2.7.patch

messages: + msg174700
2012-11-03 21:27:17rhdvsetfiles: + zipdecrypt-3.patch
nosy: + rhdv
messages: + msg174699

2010-11-04 16:04:00shashanksetfiles: + zipdecrypt.patch
2010-11-04 16:03:29shashanksetfiles: - zipdecrypt.patch
2010-11-04 15:54:18shashanksetfiles: + zipdecrypt.patch

messages: + msg120417
2010-10-23 21:22:10shashanksetfiles: + zipdecrypt.patch

messages: + msg119476
2010-10-12 12:34:12pitrousetversions: + Python 3.2, - Python 2.7
nosy: + pitrou

messages: + msg118420

stage: patch review
2010-10-12 07:56:36shashanksetfiles: + zipdecrypt.patch
2010-10-12 07:17:51shashanksetfiles: + zipdecrypt.patch

messages: + msg118405
2010-10-07 15:24:43r.david.murraysetmessages: + msg118112
2010-10-07 13:50:19shashanksetfiles: + zipdecrypt.patch

messages: + msg118110
2010-10-05 22:35:56r.david.murraysetnosy: + r.david.murray
messages: + msg118039
2010-10-05 20:08:50shashankcreate