Issue 17436: hashlib: add a method to hash the content of a file

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61638

classification

Title:	hashlib: add a method to hash the content of a file
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	christian.heimes, jcea, palaviv, techtonik, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-03-16 10:11 by techtonik, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
17436.patch	palaviv, 2016-04-01 13:13		review
17436-2.patch	palaviv, 2016-04-02 14:34		review

Messages (12)
msg184301 - (view)	Author: anatoly techtonik (techtonik)	Date: 2013-03-16 10:11
http://docs.python.org/3/library/hashlib#hashlib.hash.update The hashlib is most useful for big chunks of data, and that means every time you need to create a wrapper for reading from files. It makes sense to allow hashlib.update accept file like object to read from.
msg184302 - (view)	Author: anatoly techtonik (techtonik)	Date: 2013-03-16 10:36
Otherwise you need to repeat this code. def filehash(filepath): blocksize = 64*1024 sha = hashlib.sha256() with open(filepath, 'rb') as fp: while True: data = fp.read(blocksize) if not data: break sha.update(data) return sha.hexdigest()
msg184304 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-16 11:11
> It makes sense to allow hashlib.update accept file like object > to read from. Not update directly, but I agree that an helper would be convinient. Here is another proposition using unbuffered file and readinto() with bytearray. It should be faster, but I didn't try with a benchmark. I also wrote two functions, because sometimes you have a file object, not a file path. --- import hashlib, sys def hash_readfile_obj(obj, fp, buffersize=64 * 1024): buffer = bytearray(buffersize) while True: size = fp.readinto(buffer) if not size: break if size == buffersize: obj.update(buffer) else: obj.update(buffer[:size]) def hash_readfile(obj, filepath, buffersize=64 * 1024): with open(filepath, 'rb', buffering=0) as fp: hash_readfile_obj(obj, fp, buffersize) def file_sha256(filepath): sha = hashlib.sha256() hash_readfile(sha, filepath) return sha.hexdigest() for name in sys.argv[1:]: print("%s %s" % (file_sha256(name), name)) --- readfile() and readfile_obj() should be methods of an hash object.
msg184320 - (view)	Author: anatoly techtonik (techtonik)	Date: 2013-03-16 15:11
Even though I mentioned passing file object in the title of this bugreport, what I really need is the following API: hexhash = hashlib.sha256().readfile(filename).hexdigest()
msg184321 - (view)	Author: anatoly techtonik (techtonik)	Date: 2013-03-16 15:12
Why unbuffered will be faster??
msg184337 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-16 17:32
> Why unbuffered will be faster?? Well, I'm not sure that it is faster. But I would prefer to avoid buffering if it is not needed. 2013/3/16 anatoly techtonik <report@bugs.python.org>: > > anatoly techtonik added the comment: > > Why unbuffered will be faster?? > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue17436> > _______________________________________
msg184339 - (view)	Author: anatoly techtonik (techtonik)	Date: 2013-03-16 17:56
I don't get that. I thought that buffered reading should be faster, although I agree that OS should handle this better. Why have the buffering turned on by default then? (I miss the ability to fork discussions from tracker, but there is no choice).
msg184377 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-17 14:34
> obj.update(buffer[:size]) This code does an useless memory copy: obj.update(memoryview(buffer)[:size]) can be used instead.
msg262738 - (view)	Author: Aviv Palivoda (palaviv) *	Date: 2016-04-01 13:13
While working on issue 26488 I found a real need for this feature. I added a new method to the hash object named fromfile(). The function update the hash object with the content of the file like object it receives. I only added the feature to hash algorithm provided by OpenSSL. If there will be good reviews on this I will do the work of adding this to all hash algorithms.
msg262739 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-01 13:23
> I added a new method to the hash object named fromfile(). Usually, fromxxx() is used to create a new object. In your case, it's more to update an existing hash object. So I would prefer the name "readfile". IMHO you need two methods: * hashobj.readfile(filename: str) * hashobj.readfileobj(file) where file is an object with a read() method which returns bytes strings The implementation of the two methods can be very different. In readfile(), you know that it's a regular file which exists on the file system. So you can directly uses _Py_fstat() to get st_blksize and then loop on _Py_read(). For readfileobj(), the file object doesn't need to exist on disk, fileno() can raises an exception or not exist at all. I suggest to look at copyfile() and copyfileobj() functions of the shutil module. For example, copyfileobj() has an optional parameter for the buffer size. You should probably uses that to avoid complex heuristic to guess the optimal buffer size.
msg262740 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2016-04-01 13:27
For readfile() it might make more sense to implement it directly in C and let OpenSSL's BIO layer handle IO internally. It's more efficient and you can release the GIL around the whole operation.
msg262796 - (view)	Author: Aviv Palivoda (palaviv) *	Date: 2016-04-02 14:34
> * hashobj.readfile(filename: str) > * hashobj.readfileobj(file) where file is an object with a read() method which returns bytes strings I changed the API to the one Victor suggested. > For readfile() it might make more sense to implement it directly in C and let OpenSSL's BIO layer handle IO internally. It's more efficient and you can release the GIL around the whole operation. The readfile method use the openSSL BIO and releases the GIL around the all operation. > I suggest to look at copyfile() and copyfileobj() functions of the shutil module. For example, copyfileobj() has an optional parameter for the buffer size. You should probably uses that to avoid complex heuristic to guess the optimal buffer size. Added a block_size optional argument to the readfileobj(). > In readfile(), you know that it's a regular file which exists on the file system. So you can directly uses _Py_fstat() to get st_blksize Currently using constant block size in readfile(). From the discussion in issue 26488 I am not sure if this should be changed.

History
Date	User	Action	Args
2022-04-11 14:57:42	admin	set	github: 61638
2016-04-02 14:34:40	palaviv	set	files: + 17436-2.patch messages: + msg262796
2016-04-01 13:27:36	christian.heimes	set	nosy: + christian.heimes messages: + msg262740
2016-04-01 13:23:23	vstinner	set	messages: + msg262739
2016-04-01 13:13:23	palaviv	set	files: + 17436.patch versions: + Python 3.6, - Python 3.4 nosy: + palaviv messages: + msg262738 keywords: + patch
2013-03-18 21:19:55	vstinner	set	title: pass a file object to hashlib.update -> hashlib: add a method to hash the content of a file versions: + Python 3.4, - Python 3.5
2013-03-17 14:34:05	vstinner	set	messages: + msg184377
2013-03-16 17:56:08	techtonik	set	messages: + msg184339
2013-03-16 17:32:31	vstinner	set	messages: + msg184337
2013-03-16 15:12:27	techtonik	set	messages: + msg184321
2013-03-16 15:11:08	techtonik	set	messages: + msg184320
2013-03-16 14:36:30	jcea	set	nosy: + jcea
2013-03-16 11:11:24	vstinner	set	nosy: + vstinner messages: + msg184304
2013-03-16 10:36:54	techtonik	set	messages: + msg184302
2013-03-16 10:35:24	techtonik	set	title: pass a string to hashlib.update -> pass a file object to hashlib.update
2013-03-16 10:11:12	techtonik	create