Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a file_digest() function in hashlib #89313

Closed
tarekziade mannequin opened this issue Sep 9, 2021 · 17 comments
Closed

Add a file_digest() function in hashlib #89313

tarekziade mannequin opened this issue Sep 9, 2021 · 17 comments
Assignees
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@tarekziade
Copy link
Mannequin

tarekziade mannequin commented Sep 9, 2021

BPO 45150
Nosy @gpshead, @tiran, @tarekziade, @SonOfLilit, @miss-islington
PRs
  • bpo-45150: add a simple file_digest helper #28252
  • bpo-45150: draft implementation only for sha224,sha256 #31928
  • bpo-45150: Add hashlib.file_digest() for efficient file hashing #31930
  • bpo-45150: Fix testing under FIPS mode (GH-32046) #32046
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tiran'
    closed_at = None
    created_at = <Date 2021-09-09.10:03:17.355>
    labels = ['type-feature', 'library', '3.11']
    title = 'Add a file_digest() function in hashlib'
    updated_at = <Date 2022-03-22.15:41:04.324>
    user = 'https://github.com/tarekziade'

    bugs.python.org fields:

    activity = <Date 2022-03-22.15:41:04.324>
    actor = 'christian.heimes'
    assignee = 'christian.heimes'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2021-09-09.10:03:17.355>
    creator = 'tarek'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 45150
    keywords = ['patch']
    message_count = 14.0
    messages = ['401457', '401461', '401462', '415138', '415139', '415141', '415320', '415321', '415324', '415326', '415336', '415356', '415754', '415793']
    nosy_count = 6.0
    nosy_names = ['gregory.p.smith', 'christian.heimes', 'tarek', 'python-dev', 'Aur.Saraf', 'miss-islington']
    pr_nums = ['28252', '31928', '31930', '32046']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue45150'
    versions = ['Python 3.11']

    @tarekziade
    Copy link
    Mannequin Author

    tarekziade mannequin commented Sep 9, 2021

    I am proposing the addition of a very simple helper to return the hash of a file.

    @tarekziade tarekziade mannequin self-assigned this Sep 9, 2021
    @tarekziade tarekziade mannequin added the stdlib Python modules in the Lib dir label Sep 9, 2021
    @tarekziade tarekziade mannequin self-assigned this Sep 9, 2021
    @tarekziade tarekziade mannequin added the stdlib Python modules in the Lib dir label Sep 9, 2021
    @serhiy-storchaka serhiy-storchaka added 3.11 only security fixes type-feature A feature request or enhancement labels Sep 9, 2021
    @tiran
    Copy link
    Member

    tiran commented Sep 9, 2021

    Hey Tarek, long time no see!

    • the _sha256 module is optional, can be disabled and is not available in some distributions.

    • I also don't like to use sha256 has default. It's slow, even slower than sha512. Any default makes it also harder to upgrade to a better, more secure default in the future.

    • like hmac.new() a file_digest() should accept PEP-452-compatible arguments and hash name as digstmod argument, not just a callable.

    • a filename argument prevents users from passing in file-like objects like BytesIO.

    • 4096 bytes chunk size is very conservative. The call overhead for read() and update() may dominate the performance of the function.

    • The hex argument feels weird.

    In a perfect world, the hash and hmac objects should get an "update_file" method. The OpenSSL-based hashes could even release the GIL and utilize OpenSSL's BIO layer to avoid any Python overhead.

    @tarekziade
    Copy link
    Mannequin Author

    tarekziade mannequin commented Sep 9, 2021

    Hey Christian, I hope things are well for you!
    Thanks for all the precious feedback, I'll rework the patch accordingly

    @SonOfLilit
    Copy link
    Mannequin

    SonOfLilit mannequin commented Mar 14, 2022

    Tarek,

    Are you still working on this? Would you like me to take over?

    Aur

    @tarekziade
    Copy link
    Mannequin Author

    tarekziade mannequin commented Mar 14, 2022

    @aur, go for it, I started to implement it and got lost into the details for each backend..

    @SonOfLilit
    Copy link
    Mannequin

    SonOfLilit mannequin commented Mar 14, 2022

    OK, I'll give it a go.

    @SonOfLilit
    Copy link
    Mannequin

    SonOfLilit mannequin commented Mar 16, 2022

    PR contains a draft implementation, would appreciate some review before I implement the same interface on all builtin hashes as well as OpenSSL hashes.

    @SonOfLilit
    Copy link
    Mannequin

    SonOfLilit mannequin commented Mar 16, 2022

    The rationale behind from_raw_file() and the special treatment of non-buffered IO is that there is no read_buffer() API or other clean way to say "I want to read just what's currently in the buffer so that from now on I could read directly from the file descriptor without harm".

    If you want to read from a buffered file object, sure, just call from_file(). If you want to ensure you'll get full performance benefits, call from_raw_file(). If you pass an eligible file object to from_file() you'll get the benefits anyway, because why not.

    @SonOfLilit
    Copy link
    Mannequin

    SonOfLilit mannequin commented Mar 16, 2022

    Forgot an important warning: this is the first time I write C code against the Python API, and I didn't thoroughly read the guide (or at all, to be honest). I think I did a good job, but please suspect my code of noob errors.

    I'm especially not confident that it's OK to not do any special handling of signals. Can read() return 0 if it was interrupted by a signal? This will stop the hash calculation midway and behave as if it succeeded. Sounds suspiciously like something we don't want. Also, I probably should support signals because such a long operation is something the user definitely might want to interrupt?

    May I have some guidance please? Would it be enough to copy the code from fileutils.c _Py_Read() and addi an outer loop so we can do many reads with the GIL released and still call PyErr_CheckSignals when needed with the GIL taken?

    @SonOfLilit
    Copy link
    Mannequin

    SonOfLilit mannequin commented Mar 16, 2022

    Added an attempt to handle signals. I don't think it's working, because when I press Ctrl+C while hashing a long file, it only raises KeyboardInterrupt after waiting the amount of time it usually takes the C code to return, but maybe that's not a good test?

    @tiran
    Copy link
    Member

    tiran commented Mar 16, 2022

    Before we continue hacking on an implementation, let's discuss some API design.

    • Multiple functions or multiple dispatch (name, fileobj) are confusing to users. Let's keep it simple and only implement one function that operates on file objects.

    • The function should work with most binary file-like including open(..., "rb"), BytesIO, SocketIO. mmap.mmap() is not file-like enough. Anon mapping doesn't provide a fileno and the mmap object has no readinto().

    • The function should accept either digest name, digest constructor, or a callable that returns a digest object. The latter makes it possible to reuse file_digest() with MAC constructs like HMAC.

    • Don't do any I/O in C unless you are prepared to enter a world of pain and suffering. It's hard to get it right across platforms. For example your C code does not work for SocketIO on Windows.

    • If we decide to implement an accelerator in C, then we don't have to bother with our fallback copies like _sha256module.c. They are slow and only used when OpenSSL is not available.

    @tiran tiran assigned tiran and unassigned tarekziade Mar 16, 2022
    @SonOfLilit
    Copy link
    Mannequin

    SonOfLilit mannequin commented Mar 16, 2022

    I don't think HMAC of a file is a common enough use case to support, but I have absolutely no problem conceding this point, the cost of supporting it is very low.

    I/O in C is a world of pain in general. In the specific case of io.RawIOBase objects (non-buffered binary files) to my understanding it's not that terrible (am I right? Does my I/O code work as-is?). To my understanding, providing a fast path just for this case that calculates the hash without taking the GIL for every chunk would be very nice to have for many use cases.

    Now, we could just be happy with file_digest() having an if for isinstance(io.RawIOBase) that chooses a fast code path silently. But since non-buffered binary files are so hard to tell apart from other types of file-like objects, as a user of this code I would like to have a way to say "I want the fast path, please raise if I accidentally passed the wrong things and got the regular path". We could have file_digest('sha256', open(path, 'rb', buffered=0), ensure_fast_io=True), but I think for this use case raw_file_digest('sha256', open(path, 'rb', buffered=0)) is cleaner.

    In all other cases you just call file_digest(), probably get the Python I/O and not the C I/O, and are still happy to have that loop written for you by someone who knows what they're doing.

    For the same reason I think the fast path should only support hash names and not constructors/functions/etc', which would complicate it because new-object-can-be-accessed-without-GIL wouldn't necessarily apply.

    Does this make sense?

    @miss-islington
    Copy link
    Contributor

    New changeset 4f97d64 by Christian Heimes in branch 'main':
    bpo-45150: Add hashlib.file_digest() for efficient file hashing (GH-31930)
    4f97d64

    @tiran
    Copy link
    Member

    tiran commented Mar 22, 2022

    New changeset e03db6d by Christian Heimes in branch 'main':
    bpo-45150: Fix testing under FIPS mode (GH-32046)
    e03db6d

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @pablogsal
    Copy link
    Member

    @tiran Can you made a PR adding the file_digest to the 3.11 what's new, please?

    miss-islington pushed a commit that referenced this issue Aug 13, 2022
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 13, 2022
    )
    
    Automerge-Triggered-By: GH:pablogsal
    (cherry picked from commit 0b329f4)
    
    Co-authored-by: Christian Heimes <christian@python.org>
    @tiran tiran closed this as completed Aug 13, 2022
    miss-islington added a commit that referenced this issue Aug 13, 2022
    Automerge-Triggered-By: GH:pablogsal
    (cherry picked from commit 0b329f4)
    
    Co-authored-by: Christian Heimes <christian@python.org>
    @calestyo
    Copy link
    Contributor

    Hey folks.

    I know this is closed and perhaps I should simply file a new request... but would you consider to extend the interface of that function to (efficiently) calculate a file's hashsum for multiple algorithms (i.e. without reading it once for every algo)?

    One could perhaps do so by accepting some array for digest and return a dict where the alogo name is the key and the hashvalue the value.

    Or perhaps something smarter ^^

    Cheers,
    Chris.

    @gpshead
    Copy link
    Member

    gpshead commented Jun 23, 2023

    Please file a new feature request issue here for that.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants