New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch for zip decryption speedup #54239
Comments
As promised in this thread http://mail.python.org/pipermail/python-dev/2009-August/091450.html (a year ago!), attached is a patch that replaces simple zip decryption logic written in pure python with that in C. As reported in the link above, this can result in speedups up to couple of orders of magnitude. There doesn't seem to be any need to add any new tests as this patch doesn't change any public API |
It would be nice to retain the pure python version as a fallback for non CPython implementations, that will require tweaking the tests to make sure both are tested. |
I have updated the patch with a check for the availability of C impl and to use pure-py impl as a fallback. How do you suggest would the tests change? As I had mentioned before, in my understanding since there is no change in the API the already existing tests should work. One can simulate the absence of C impl in a system where it is present but AFAIU this is not what it is usually done (e.g, in the case of optional zlib dependency in the same module) |
It is what is normally done *now* when there is both a C and a python implementation (see, for example, test_datetime.py and test_io.py for two different approaches to that). Not all tests have been updated to this practice. Thanks for working on this. |
Attached is a patch with changes in Lib/test/test_zipfile.py to test both C and pure-py impls (on systems where the C impl is present). Admittedly, this approach to emulating the absence of C impl is a bit hacky. This is primarily because the changed class is not a part of the public API and hence not being tested directly. David, could you verify that the approach is ok? |
Hello, Some quick comments:
|
|
I had uploaded an incorrect patch. New corrected patch against trunk (on Mac OS uploaded). |
Attached you will find the updated patch for the python 3 tree as of now. I have measured a speed-up of more than a factor 100. |
Patch for python 2.7 Same patch as for python 3, backported to python 2.7 Tested on Linux only. |
I quote from Gregory P. Smith (msg91897): """ zipfile can be used by people to get their data out of such files. We I think that the effort required for speedup of this almost useless feature is excessive. If someone want to implement the "strong encryption" for zip files - welcome. |
See also criticism in the original discussion: http://mail.python.org/pipermail/python-dev/2009-August/091450.html . |
My use case is decrypting files of 100's of megabytes. This is so slow that it is quite useless. About an hour or so. I do agree that the encryption is worthless, but that is not important for my use case where I want to discourage people from reverse engineering the contents. Some people argue that adding C code to Python is dangerous as it will lead to bugs, vulnerabilities etc. |
We aren't particularly interested in helping people make their files slightly harder to reverse engineer, either, so I don't think that is a good enough reason for accepting this. There might be other reasons that are good enough, but I don't think that is one of them. |
From the zlib FAQ:
I don't see the point of a weak and easily breakable encryption. |
If the encryption is so horrible why is there any support (with bad performance) at all in Python? To reiterate my previous point. Either support something and do it well, or don't. Don't do a half job. Please note that there are more attempts to fix this, so I am not completely alone here. |
We don't remove it as it would break existing programs which rely on I wouldn't be against the acceleration myself, however I am not |
Here is a patch which optimize (speed up 2x) Python implementation of ZIP decryptor. It is almost the maximum of what can be achieved without significant degradation of maintainability. Of course, 2x is less then 100x, but it more portable and costs almost nothing. If that's not enough, I suggest to use an external unzip. You can even run multiple unzips at a time for the parallel extraction of multiple files. If in the future someone will implement the strong encryption for ZIP files, it is possible it will required a C accelerator module and it is possible there will be a place for PKWARE's traditional encryption. |
I would say it there so that people can use python to "decrypt" an "encrypted" zip archive they have been sent that was generated by some other tool. I would say this is not a particularly strong use case since there are other tools that can be used for this, but as Antoine said removing the "feature" could break existing working code, so we are unlikely to do it. |
The current situation is now that the decryption is part of Python. It is well known to be computationally intensive and should therefore be implemented in C. This patch provides that support. The discussion if Python should support the decryption is behind us, as the support is already in. The only discussion should be about if there are enough users wanting this performance improvement to add it. From the download statistics of czipfile I would say that there are roughly 2500 users interested. |
Brian, you requested tests, but PR 550 doesn't add new API and doesn't change the behavior of public API. No new tests are needed. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: