Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for zip decryption speedup #54239

Closed
shashank mannequin opened this issue Oct 5, 2010 · 22 comments
Closed

Patch for zip decryption speedup #54239

shashank mannequin opened this issue Oct 5, 2010 · 22 comments
Labels
3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@shashank
Copy link
Mannequin

shashank mannequin commented Oct 5, 2010

BPO 10030
Nosy @pitrou, @tiran, @bitdancer, @briancurtin, @serhiy-storchaka
PRs
  • bpo-10030: Sped up reading encrypted ZIP files by 2 times. #550
  • Files
  • zipdecrypt.patch
  • zipdecrypt.patch: Updated patch with conditional use of C implementation
  • zipdecrypt.patch: modified tests to test both C and Py impl
  • zipdecrypt.patch: replacing tabs with spaces in the patch
  • zipdecrypt.patch: updated patch
  • zipdecrypt.patch: corrected patch
  • zipdecrypt-3.patch: Patch for python3
  • zipdecrypt-2.7.patch: Patch for python 2.7
  • doc.patch
  • zipfile_decryptor_speedup.patch: Speedup of Python implementation
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-03-30.16:10:19.563>
    created_at = <Date 2010-10-05.20:08:50.510>
    labels = ['3.7', 'library', 'performance']
    title = 'Patch for zip decryption speedup'
    updated_at = <Date 2017-03-30.16:10:19.562>
    user = 'https://bugs.python.org/shashank'

    bugs.python.org fields:

    activity = <Date 2017-03-30.16:10:19.562>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-03-30.16:10:19.563>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2010-10-05.20:08:50.510>
    creator = 'shashank'
    dependencies = []
    files = ['19135', '19149', '19196', '19197', '19347', '19495', '27867', '27868', '27881', '27882']
    hgrepos = []
    issue_num = 10030
    keywords = ['patch']
    message_count = 22.0
    messages = ['118030', '118039', '118110', '118112', '118405', '118420', '119476', '120417', '174699', '174700', '174789', '174791', '174793', '174794', '174795', '174803', '174806', '174816', '174823', '174830', '290497', '290857']
    nosy_count = 7.0
    nosy_names = ['pitrou', 'christian.heimes', 'r.david.murray', 'brian.curtin', 'shashank', 'serhiy.storchaka', 'rhdv']
    pr_nums = ['550']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue10030'
    versions = ['Python 3.7']

    @shashank
    Copy link
    Mannequin Author

    shashank mannequin commented Oct 5, 2010

    As promised in this thread http://mail.python.org/pipermail/python-dev/2009-August/091450.html (a year ago!), attached is a patch that replaces simple zip decryption logic written in pure python with that in C.

    As reported in the link above, this can result in speedups up to couple of orders of magnitude.

    There doesn't seem to be any need to add any new tests as this patch doesn't change any public API

    @shashank shashank mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Oct 5, 2010
    @bitdancer
    Copy link
    Member

    It would be nice to retain the pure python version as a fallback for non CPython implementations, that will require tweaking the tests to make sure both are tested.

    @shashank
    Copy link
    Mannequin Author

    shashank mannequin commented Oct 7, 2010

    I have updated the patch with a check for the availability of C impl and to use pure-py impl as a fallback.

    How do you suggest would the tests change? As I had mentioned before, in my understanding since there is no change in the API the already existing tests should work.

    One can simulate the absence of C impl in a system where it is present but AFAIU this is not what it is usually done (e.g, in the case of optional zlib dependency in the same module)

    @bitdancer
    Copy link
    Member

    It is what is normally done *now* when there is both a C and a python implementation (see, for example, test_datetime.py and test_io.py for two different approaches to that). Not all tests have been updated to this practice.

    Thanks for working on this.

    @shashank
    Copy link
    Mannequin Author

    shashank mannequin commented Oct 12, 2010

    Attached is a patch with changes in Lib/test/test_zipfile.py to test both C and pure-py impls (on systems where the C impl is present).

    Admittedly, this approach to emulating the absence of C impl is a bit hacky. This is primarily because the changed class is not a part of the public API and hence not being tested directly.

    David, could you verify that the approach is ok?

    @pitrou
    Copy link
    Member

    pitrou commented Oct 12, 2010

    Hello,

    Some quick comments:

    • the C module should be private and therefore called _zipdecrypt
    • if you want to avoid API mismatch, you could give a tp_call to your C decrypter object, rather than a "decrypt" method
    • you can put all initialization code in zipdecrypt_new and avoid the need for zipdecrypt_init
    • it's better to use the "y*" code in PyArg_ParseTuple, rather than "s#"
    • you should define your module as PY_SSIZE_T_CLEAN and use Py_ssize_t as length variables (rather than int)
    • you *mustn't* change the contents of the buffer which is given you by "s#" or "y*", since that buffer is read-only (it can be a bytes object); instead, create a new bytes object using PyBytes_FromStringAndSize(NULL, length) and write into that; or, if you want a read-write buffer, use the "w*" code

    @shashank
    Copy link
    Mannequin Author

    shashank mannequin commented Oct 23, 2010

    the C module should be private and therefore called _zipdecrypt
    done

    if you want to avoid API mismatch, you could give a tp_call to your C >decrypter object, rather than a "decrypt" method
    done

    • you can put all initialization code in zipdecrypt_new and avoid the >need for zipdecrypt_init
      keeping this similar to the existing _ZipDecrypter class in ZipFile (all initialization in init rather than new), which was probably to allow re-init and re-use of one instance

    it's better to use the "y*" code in PyArg_ParseTuple, rather than "s#"
    y* does not seem to be available in 2.7, using s* instead

    you should define your module as PY_SSIZE_T_CLEAN and use Py_ssize_t >as length variables (rather than int)
    done

    you *mustn't* change the contents of the buffer which is given you by >"s#" or "y*", since that buffer is read-only (it can be a bytes >object); instead, create a new bytes object using >PyBytes_FromStringAndSize(NULL, length) and write into that; or, if >you want a read-write buffer, use the "w*" code
    corrected, not altering the input buffer, reading input buffer as s*

    @shashank
    Copy link
    Mannequin Author

    shashank mannequin commented Nov 4, 2010

    I had uploaded an incorrect patch. New corrected patch against trunk (on Mac OS uploaded).

    @rhdv
    Copy link
    Mannequin

    rhdv mannequin commented Nov 3, 2012

    Attached you will find the updated patch for the python 3 tree as of now.

    I have measured a speed-up of more than a factor 100.

    @rhdv
    Copy link
    Mannequin

    rhdv mannequin commented Nov 3, 2012

    Patch for python 2.7

    Same patch as for python 3, backported to python 2.7

    Tested on Linux only.

    @serhiy-storchaka
    Copy link
    Member

    I quote from Gregory P. Smith (msg91897):

    """
    The decryption provided by the zipfile module is for the worthless
    32-bit crc based "encryption" of zipfiles. I think promoting the use of
    that is a bad idea.

    zipfile can be used by people to get their data out of such files. We
    should not encourage them to put it and/or their code into such a stupid
    format.
    """

    I think that the effort required for speedup of this almost useless feature is excessive.

    If someone want to implement the "strong encryption" for zip files - welcome.

    @serhiy-storchaka
    Copy link
    Member

    See also criticism in the original discussion: http://mail.python.org/pipermail/python-dev/2009-August/091450.html .

    @rhdv
    Copy link
    Mannequin

    rhdv mannequin commented Nov 4, 2012

    My use case is decrypting files of 100's of megabytes. This is so slow that it is quite useless. About an hour or so.

    I do agree that the encryption is worthless, but that is not important for my use case where I want to discourage people from reverse engineering the contents.
    If it is so dangerous as some people have pointed out, it should be removed at the cost of not supporting a standard feature of ZIP files.
    In my opinion you either support a feature and you support it good (efficient) or you don't. As it stands now, users will be disappointed in using a supported feature.

    Some people argue that adding C code to Python is dangerous as it will lead to bugs, vulnerabilities etc.
    You could dismiss every addition with C code to Python with this argument, so there must be some positive aspects to outweigh the negative side. The negative side is fairly small (50 lines of very simple C code), plus some standard Python glue code.
    The benefit is a 100 fold increase in performance and the removal of 1 line of documentation telling that this feature is extremely slow.
    (patch attached)

    @bitdancer
    Copy link
    Member

    We aren't particularly interested in helping people make their files slightly harder to reverse engineer, either, so I don't think that is a good enough reason for accepting this. There might be other reasons that are good enough, but I don't think that is one of them.

    @tiran
    Copy link
    Member

    tiran commented Nov 4, 2012

    From the zlib FAQ:

    1. How can I encrypt/decrypt zip files with zlib?

      zlib doesn't support encryption. The original PKZIP encryption is very weak
      and can be broken with freely available programs. To get strong encryption,
      use GnuPG, http://www.gnupg.org/ , which already includes zlib compression.
      For PKZIP compatible "encryption", look at http://www.info-zip.org/

    I don't see the point of a weak and easily breakable encryption.

    @rhdv
    Copy link
    Mannequin

    rhdv mannequin commented Nov 4, 2012

    If the encryption is so horrible why is there any support (with bad performance) at all in Python?
    It would be better to remove it altogether.
    This prevents users from building software using this feature only to find out later how bad the performance is. (This is the main reason why I have submitted this patch.)
    If the support had not been in Python I would not have used this feature to begin with.

    To reiterate my previous point. Either support something and do it well, or don't. Don't do a half job.

    Please note that there are more attempts to fix this, so I am not completely alone here.
    http://pypi.python.org/pypi/czipfile

    @pitrou
    Copy link
    Member

    pitrou commented Nov 4, 2012

    If the encryption is so horrible why is there any support (with bad
    performance) at all in Python?
    It would be better to remove it altogether.

    We don't remove it as it would break existing programs which rely on
    this feature. However adding a bunch of C code to improve performance of
    such a questionable feature is controversial.

    I wouldn't be against the acceleration myself, however I am not
    interested in reviewing or accepting it, sorry.

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which optimize (speed up 2x) Python implementation of ZIP decryptor. It is almost the maximum of what can be achieved without significant degradation of maintainability.

    Of course, 2x is less then 100x, but it more portable and costs almost nothing. If that's not enough, I suggest to use an external unzip. You can even run multiple unzips at a time for the parallel extraction of multiple files.

    If in the future someone will implement the strong encryption for ZIP files, it is possible it will required a C accelerator module and it is possible there will be a place for PKWARE's traditional encryption.

    @bitdancer
    Copy link
    Member

    If the encryption is so horrible why is there any support (with bad
    performance) at all in Python?

    I would say it there so that people can use python to "decrypt" an "encrypted" zip archive they have been sent that was generated by some other tool. I would say this is not a particularly strong use case since there are other tools that can be used for this, but as Antoine said removing the "feature" could break existing working code, so we are unlikely to do it.

    @rhdv
    Copy link
    Mannequin

    rhdv mannequin commented Nov 4, 2012

    The current situation is now that the decryption is part of Python. It is well known to be computationally intensive and should therefore be implemented in C.

    This patch provides that support.

    The discussion if Python should support the decryption is behind us, as the support is already in.

    The only discussion should be about if there are enough users wanting this performance improvement to add it.

    From the download statistics of czipfile I would say that there are roughly 2500 users interested.

    @serhiy-storchaka
    Copy link
    Member

    Brian, you requested tests, but PR 550 doesn't add new API and doesn't change the behavior of public API. No new tests are needed.

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Mar 25, 2017
    @serhiy-storchaka
    Copy link
    Member

    New changeset 06e5225 by Serhiy Storchaka in branch 'master':
    bpo-10030: Sped up reading encrypted ZIP files by 2 times. (#550)
    06e5225

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants