This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add a file_digest() function in hashlib
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: christian.heimes Nosy List: Aur.Saraf, christian.heimes, gregory.p.smith, miss-islington, python-dev, tarek
Priority: normal Keywords: patch

Created on 2021-09-09 10:03 by tarek, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 28252 open python-dev, 2021-09-09 10:04
PR 31928 open python-dev, 2022-03-16 02:11
PR 31930 merged christian.heimes, 2022-03-16 11:49
PR 32046 merged christian.heimes, 2022-03-22 13:06
Messages (14)
msg401457 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2021-09-09 10:03
I am proposing the addition of a very simple helper to return the hash of a file.
msg401461 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-09-09 10:48
Hey Tarek, long time no see!

* the _sha256 module is optional, can be disabled and is not available in some distributions.

* I also don't like to use sha256 has default. It's slow, even slower than sha512. Any default makes it also harder to upgrade to a better, more secure default in the future.

* like hmac.new() a file_digest() should accept PEP 452-compatible arguments and hash name as digstmod argument, not just a callable.

* a filename argument prevents users from passing in file-like objects like BytesIO.

* 4096 bytes chunk size is very conservative. The call overhead for read() and update() may dominate the performance of the function.

* The hex argument feels weird.

In a perfect world, the hash and hmac objects should get an "update_file" method. The OpenSSL-based hashes could even release the GIL and utilize OpenSSL's BIO layer to avoid any Python overhead.
msg401462 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2021-09-09 10:56
Hey Christian, I hope things are well for you!
Thanks for all the precious feedback, I'll rework the patch accordingly
msg415138 - (view) Author: Aur Saraf (Aur.Saraf) * Date: 2022-03-14 13:56
Tarek,

Are you still working on this? Would you like me to take over?

Aur
msg415139 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2022-03-14 14:00
@Aur, go for it, I started to implement it and got lost into the details for each backend..
msg415141 - (view) Author: Aur Saraf (Aur.Saraf) * Date: 2022-03-14 14:13
OK, I'll give it a go.
msg415320 - (view) Author: Aur Saraf (Aur.Saraf) * Date: 2022-03-16 02:14
PR contains a draft implementation, would appreciate some review before I implement the same interface on all builtin hashes as well as OpenSSL hashes.
msg415321 - (view) Author: Aur Saraf (Aur.Saraf) * Date: 2022-03-16 02:21
The rationale behind `from_raw_file()` and the special treatment of non-buffered IO is that there is no `read_buffer()` API or other clean way to say "I want to read just what's currently in the buffer so that from now on I could read directly from the file descriptor without harm".

If you want to read from a buffered file object, sure, just call `from_file()`. If you want to ensure you'll get full performance benefits, call `from_raw_file()`. If you pass an eligible file object to `from_file()` you'll get the benefits anyway, because why not.
msg415324 - (view) Author: Aur Saraf (Aur.Saraf) * Date: 2022-03-16 02:36
Forgot an important warning: this is the first time I write C code against the Python API, and I didn't thoroughly read the guide (or at all, to be honest). I think I did a good job, but please suspect my code of noob errors.

I'm especially not confident that it's OK to not do any special handling of signals. Can read() return 0 if it was interrupted by a signal? This will stop the hash calculation midway and behave as if it succeeded. Sounds suspiciously like something we don't want. Also, I probably should support signals because such a long operation is something the user definitely might want to interrupt?

May I have some guidance please? Would it be enough to copy the code from fileutils.c _Py_Read() and addi an outer loop so we can do many reads with the GIL released and still call PyErr_CheckSignals when needed with the GIL taken?
msg415326 - (view) Author: Aur Saraf (Aur.Saraf) * Date: 2022-03-16 02:58
Added an attempt to handle signals. I don't think it's working, because when I press Ctrl+C while hashing a long file, it only raises KeyboardInterrupt after waiting the amount of time it usually takes the C code to return, but maybe that's not a good test?
msg415336 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2022-03-16 11:49
Before we continue hacking on an implementation, let's discuss some API design.

- Multiple functions or multiple dispatch (name, fileobj) are confusing to users. Let's keep it simple and only implement one function that operates on file objects.

- The function should work with most binary file-like including open(..., "rb"), BytesIO, SocketIO. mmap.mmap() is not file-like enough. Anon mapping doesn't provide a fileno and the mmap object has no readinto().

- The function should accept either digest name, digest constructor, or a callable that returns a digest object. The latter makes it possible to reuse file_digest() with MAC constructs like HMAC.

- Don't do any I/O in C unless you are prepared to enter a world of pain and suffering. It's hard to get it right across platforms. For example your C code does not work for SocketIO on Windows.

- If we decide to implement an accelerator in C, then we don't have to bother with our fallback copies like _sha256module.c. They are slow and only used when OpenSSL is not available.
msg415356 - (view) Author: Aur Saraf (Aur.Saraf) * Date: 2022-03-16 18:51
I don't think HMAC of a file is a common enough use case to support, but I have absolutely no problem conceding this point, the cost of supporting it is very low.

I/O in C is a world of pain in general. In the specific case of `io.RawIOBase` objects (non-buffered binary files) to my understanding it's not _that_ terrible (am I right? Does my I/O code work as-is?). To my understanding, providing a fast path *just for this case* that calculates the hash without taking the GIL for every chunk would be very nice to have for many use cases.

Now, we could just be happy with `file_digest()` having an `if` for `isinstance(io.RawIOBase)` that chooses a fast code path silently. But since non-buffered binary files are so hard to tell apart from other types of file-like objects, as a user of this code I would like to have a way to say "I want the fast path, please raise if I accidentally passed the wrong things and got the regular path". We could have `file_digest('sha256', open(path, 'rb', buffered=0), ensure_fast_io=True)`, but I think for this use case `raw_file_digest('sha256', open(path, 'rb', buffered=0))` is cleaner.

In all other cases you just call `file_digest()`, probably get the Python I/O and not the C I/O, and are still happy to have that loop written for you by someone who knows what they're doing.

For the same reason I think the fast path should only support hash names and not constructors/functions/etc', which would complicate it because new-object-can-be-accessed-without-GIL wouldn't necessarily apply.

Does this make sense?
msg415754 - (view) Author: miss-islington (miss-islington) Date: 2022-03-22 09:37
New changeset 4f97d64c831c94660ceb01f34d51fa236ad968b0 by Christian Heimes in branch 'main':
bpo-45150: Add hashlib.file_digest() for efficient file hashing (GH-31930)
https://github.com/python/cpython/commit/4f97d64c831c94660ceb01f34d51fa236ad968b0
msg415793 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2022-03-22 15:41
New changeset e03db6d5be7cf2e6b7b55284985c404de98a9420 by Christian Heimes in branch 'main':
bpo-45150: Fix testing under FIPS mode (GH-32046)
https://github.com/python/cpython/commit/e03db6d5be7cf2e6b7b55284985c404de98a9420
History
Date User Action Args
2022-04-11 14:59:49adminsetgithub: 89313
2022-03-22 15:41:04christian.heimessetmessages: + msg415793
2022-03-22 13:06:13christian.heimessetpull_requests: + pull_request30136
2022-03-22 09:37:24miss-islingtonsetnosy: + miss-islington
messages: + msg415754
2022-03-16 18:51:44Aur.Sarafsetmessages: + msg415356
2022-03-16 11:49:42christian.heimessetpull_requests: + pull_request30021
2022-03-16 11:49:27christian.heimessetassignee: tarek -> christian.heimes
messages: + msg415336
2022-03-16 02:58:43Aur.Sarafsetmessages: + msg415326
2022-03-16 02:36:04Aur.Sarafsetmessages: + msg415324
2022-03-16 02:21:23Aur.Sarafsetmessages: + msg415321
2022-03-16 02:14:05Aur.Sarafsetmessages: + msg415320
2022-03-16 02:11:47python-devsetpull_requests: + pull_request30019
2022-03-14 14:13:50Aur.Sarafsetmessages: + msg415141
2022-03-14 14:00:05tareksetmessages: + msg415139
2022-03-14 13:56:59Aur.Sarafsetnosy: + Aur.Saraf
messages: + msg415138
2021-09-09 10:56:27tareksetmessages: + msg401462
2021-09-09 10:48:47christian.heimessetmessages: + msg401461
2021-09-09 10:16:36serhiy.storchakasetnosy: + gregory.p.smith, christian.heimes

type: enhancement
versions: + Python 3.11
2021-09-09 10:04:49python-devsetkeywords: + patch
nosy: + python-dev

pull_requests: + pull_request26673
stage: patch review
2021-09-09 10:03:17tarekcreate