This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: hashlib: add a method to hash the content of a file
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, jcea, palaviv, techtonik, vstinner
Priority: normal Keywords: patch

Created on 2013-03-16 10:11 by techtonik, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
17436.patch palaviv, 2016-04-01 13:13 review
17436-2.patch palaviv, 2016-04-02 14:34 review
Messages (12)
msg184301 - (view) Author: anatoly techtonik (techtonik) Date: 2013-03-16 10:11
http://docs.python.org/3/library/hashlib#hashlib.hash.update

The hashlib is most useful for big chunks of data, and that means every time you need to create a wrapper for reading from files. It makes sense to allow hashlib.update accept file like object to read from.
msg184302 - (view) Author: anatoly techtonik (techtonik) Date: 2013-03-16 10:36
Otherwise you need to repeat this code.

def filehash(filepath):
    blocksize = 64*1024
    sha = hashlib.sha256()
    with open(filepath, 'rb') as fp:
        while True:
            data = fp.read(blocksize)
            if not data:
                break
            sha.update(data)
    return sha.hexdigest()
msg184304 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-03-16 11:11
> It makes sense to allow hashlib.update accept file like object
> to read from.

Not update directly, but I agree that an helper would be convinient.

Here is another proposition using unbuffered file and readinto() with bytearray. It should be faster, but I didn't try with a benchmark. I also wrote two functions, because sometimes you have a file object, not a file path.

---
import hashlib, sys

def hash_readfile_obj(obj, fp, buffersize=64 * 1024):
    buffer = bytearray(buffersize)
    while True:
        size = fp.readinto(buffer)
        if not size:
            break
        if size == buffersize:
            obj.update(buffer)
        else:
            obj.update(buffer[:size])

def hash_readfile(obj, filepath, buffersize=64 * 1024):
    with open(filepath, 'rb', buffering=0) as fp:
        hash_readfile_obj(obj, fp, buffersize)

def file_sha256(filepath):
    sha = hashlib.sha256()
    hash_readfile(sha, filepath)
    return sha.hexdigest()

for name in sys.argv[1:]:
    print("%s %s" % (file_sha256(name), name))
---

readfile() and readfile_obj() should be methods of an hash object.
msg184320 - (view) Author: anatoly techtonik (techtonik) Date: 2013-03-16 15:11
Even though I mentioned passing file object in the title of this bugreport, what I really need is the following API:

  hexhash = hashlib.sha256().readfile(filename).hexdigest()
msg184321 - (view) Author: anatoly techtonik (techtonik) Date: 2013-03-16 15:12
Why unbuffered will be faster??
msg184337 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-03-16 17:32
> Why unbuffered will be faster??

Well, I'm not sure that it is faster. But I would prefer to avoid
buffering if it is not needed.

2013/3/16 anatoly techtonik <report@bugs.python.org>:
>
> anatoly techtonik added the comment:
>
> Why unbuffered will be faster??
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue17436>
> _______________________________________
msg184339 - (view) Author: anatoly techtonik (techtonik) Date: 2013-03-16 17:56
I don't get that. I thought that buffered reading should be faster, although I agree that OS should handle this better. Why have the buffering turned on by default then? (I miss the ability to fork discussions from tracker, but there is no choice).
msg184377 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-03-17 14:34
> obj.update(buffer[:size])

This code does an useless memory copy: obj.update(memoryview(buffer)[:size]) can be used instead.
msg262738 - (view) Author: Aviv Palivoda (palaviv) * Date: 2016-04-01 13:13
While working on issue 26488 I found a real need for this feature.
I added a new method to the hash object named fromfile(). The function update the hash object with the content of the file like object it receives.

I only added the feature to hash algorithm provided by OpenSSL. If there will be good reviews on this I will do the work of adding this to all hash algorithms.
msg262739 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-01 13:23
> I added a new method to the hash object named fromfile().

Usually, fromxxx() is used to create a new object. In your case, it's more to update an existing hash object. So I would prefer the name "readfile".

IMHO you need two methods:

* hashobj.readfile(filename: str)
* hashobj.readfileobj(file) where file is an object with a read() method which returns bytes strings

The implementation of the two methods can be very different. In readfile(), you know that it's a regular file which exists on the file system. So you can directly uses _Py_fstat() to get st_blksize and then loop on _Py_read().

For readfileobj(), the file object doesn't need to exist on disk, fileno() can raises an exception or not exist at all.

I suggest to look at copyfile() and copyfileobj() functions of the shutil module. For example, copyfileobj() has an optional parameter for the buffer size. You should probably uses that to avoid complex heuristic to guess the optimal buffer size.
msg262740 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2016-04-01 13:27
For readfile() it might make more sense to implement it directly in C and let OpenSSL's BIO layer handle IO internally. It's more efficient and you can release the GIL around the whole operation.
msg262796 - (view) Author: Aviv Palivoda (palaviv) * Date: 2016-04-02 14:34
> * hashobj.readfile(filename: str)
> * hashobj.readfileobj(file) where file is an object with a read() method which returns bytes strings

I changed the API to the one Victor suggested.

> For readfile() it might make more sense to implement it directly in C and let OpenSSL's BIO layer handle IO internally. It's more efficient and you can release the GIL around the whole operation.

The readfile method use the openSSL BIO and releases the GIL around the all operation.

> I suggest to look at copyfile() and copyfileobj() functions of the shutil module. For example, copyfileobj() has an optional parameter for the buffer size. You should probably uses that to avoid complex heuristic to guess the optimal buffer size.

Added a block_size optional argument to the readfileobj().

> In readfile(), you know that it's a regular file which exists on the file system. So you can directly uses _Py_fstat() to get st_blksize

Currently using constant block size in readfile(). From the discussion in issue 26488 I am not sure if this should be changed.
History
Date User Action Args
2022-04-11 14:57:42adminsetgithub: 61638
2016-04-02 14:34:40palavivsetfiles: + 17436-2.patch

messages: + msg262796
2016-04-01 13:27:36christian.heimessetnosy: + christian.heimes
messages: + msg262740
2016-04-01 13:23:23vstinnersetmessages: + msg262739
2016-04-01 13:13:23palavivsetfiles: + 17436.patch
versions: + Python 3.6, - Python 3.4
nosy: + palaviv

messages: + msg262738

keywords: + patch
2013-03-18 21:19:55vstinnersettitle: pass a file object to hashlib.update -> hashlib: add a method to hash the content of a file
versions: + Python 3.4, - Python 3.5
2013-03-17 14:34:05vstinnersetmessages: + msg184377
2013-03-16 17:56:08techtoniksetmessages: + msg184339
2013-03-16 17:32:31vstinnersetmessages: + msg184337
2013-03-16 15:12:27techtoniksetmessages: + msg184321
2013-03-16 15:11:08techtoniksetmessages: + msg184320
2013-03-16 14:36:30jceasetnosy: + jcea
2013-03-16 11:11:24vstinnersetnosy: + vstinner
messages: + msg184304
2013-03-16 10:36:54techtoniksetmessages: + msg184302
2013-03-16 10:35:24techtoniksettitle: pass a string to hashlib.update -> pass a file object to hashlib.update
2013-03-16 10:11:12techtonikcreate