This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add class-functions to hash many small objects with hashlib
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: ebfe, gregory.p.smith, pitrou, rhettinger
Priority: normal Keywords:

Created on 2010-11-03 22:44 by ebfe, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (6)
msg120351 - (view) Author: Lukas Lueg (ebfe) Date: 2010-11-03 22:44
The objects provided by hashlib mainly serve the purpose of computing hashes over strings of arbitrary size. The user gets a new object (e.g. hashlib.sha1()), calls .update() with chunks of data and then finally uses .digest() or .hexdigest() to get the hash. For convenience reasons these steps can also be done in almost one step (e.g. hashlib.sha1('foobar').hexdigest()).
While the above approach basically covers all use-cases for hash-functions, when computing hashes of many small strings it is yet inefficient (e.g. due to interpreter-overhead) and leaves out the possibility for performance improvements.

There are many cases where we need the hashes of numerous (small) objects, most or all of which being available in memory at the same time.

I therefor propose to extend the classes provided by hashlib with an additional function that takes an iterable object, computes the hash over the string representation of each member and returns the result. Due to the aim of this interface, the function is a member of the class (not the instance) and has therefor no state bound to an instance. Memory requirements are to be anticipated and met by the programmer.

For example:

foo = ['my_database_key1', 'my_database_key2']
hashlib.sha1.compute(foo) 
>> ('\x00\x00', '\xff\xff')


I consider this interface to hashlib particular useful, as we can take advantage of vector-based implementations that compute multiple hashes in one pass (e.g. through SSE2). GCC has a vector-extension that provides a *somewhat* standard way to write code that can get compiled to SSE2 or similar machine code. Examples of vector-based implementations of SHA1 and MD5 can be found at https://code.google.com/p/pyrit/issues/detail?id=207


Contigency plan: We compile to code iterating over OpenSSL's EVP-functions if compiler is other than GCC or SSE2 is not available. The same approach can be used to cover hashlib-objects for which we don't have an optimized implementation.
msg120441 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-04 21:07
This sounds dangerously like a micro-optimization to me, and I'm not sure an additional API is ok for that.
msg120450 - (view) Author: Lukas Lueg (ebfe) Date: 2010-11-04 21:40
Thanks for your comment; it is a very valid point to consider. However, as a vector-based implementation is roughly three to four times faster  than what the current API can provide by design (reduced overhead and GIL-relaxation not included), I may disagree with it.

I'm willing to make a proposal (read: patch) if you and the other overlords have enough confidence in this API-change and it has a chance to get submitted.
msg120455 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-04 21:55
> Thanks for your comment; it is a very valid point to consider.
> However, as a vector-based implementation is roughly three to four
> times faster  than what the current API can provide by design (reduced
> overhead and GIL-relaxation not included), I may disagree with it.

I fear that this kind of optimization is a moving target. If some CPU
architecture gains hardware support for cryptographic hashes, OpenSSL
will support it and be much faster than Python's parallel version.

> I'm willing to make a proposal (read: patch) if you and the other
> overlords have enough confidence in this API-change and it has a
> chance to get submitted.

If the only advantage is some speedup on parallel hashing of
many-small-strings, I don't think a new API is warranted for such a
specialized use case.
msg120457 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-04 22:20
I concur with Overlord Antoine ;-)

The API of "hashfunc *files" looks like something that should be done at the command line in bash.   For the most part, I think users are better-off using normal Python loops so they can do something with the computed hashes as the files are read.
msg120564 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-11-06 00:27
This is an application specific optimization that I'd like to see as its own library or a feature added to NSS or OpenSSL. Regardless if you want this for python, write an extension module and put it on pypy before aiming for the stdlib.  I understand the optimization potential but it is a bit obscure.
History
Date User Action Args
2022-04-11 14:57:08adminsetgithub: 54511
2010-11-06 00:27:23gregory.p.smithsetstatus: open -> closed
resolution: rejected
messages: + msg120564
2010-11-04 22:20:14rhettingersetnosy: + rhettinger
messages: + msg120457
2010-11-04 21:55:26pitrousetmessages: + msg120455
2010-11-04 21:40:23ebfesetmessages: + msg120450
2010-11-04 21:07:59pitrousetnosy: + gregory.p.smith, pitrou
messages: + msg120441
2010-11-03 22:44:16ebfecreate