This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Feature request: Add decdigest() to hashlib
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.10
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: 2d4d, christian.heimes, gregory.p.smith
Priority: normal Keywords:

Created on 2021-01-16 21:33 by 2d4d, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg385150 - (view) Author: Arnim Rupp (2d4d) Date: 2021-01-16 21:33
Problem: hashlib only offers digest() and hexdigest() but the fastest way to work with hashes is as integer.

The first thing loki does after getting the hashes is to convert them to int:
md5, sha1, sha256 = generateHashes(fileData)
                        md5_num=int(md5, 16)
                        sha1_num=int(sha1, 16)
                        sha256_num=int(sha256, 16)
https://github.com/Neo23x0/Loki/blob/master/loki.py

All the ~50000 hashes to compare are also converted to int after reading them from a file. The comparison is about twice as fast compared to hexdigest in strings because it uses just half the memory. 

(The use case here is to compare these 50,000 hashes to the hashes of all the 200,000 files on a system that gets scanned for malicious files.)

Solution: Add decdigest() to hashlib which returns the int version of the hash. This has 2 advantages: 
1. It saves the time for converting the hash to hex and back
2. Having decdigest() in the documentation inspires more programmers to work with hashes as int opposed to slow strings (where it's performance relevant.)

Should be just few lines of code for each algorithm, I could do the PR.

static PyObject *
_sha3_shake_128_hexdigest(SHA3object *self, PyObject *arg)
{
    PyObject *return_value = NULL;
    unsigned long length;

    if (!_PyLong_UnsignedLong_Converter(arg, &length)) {
        goto exit;
    }
    return_value = _sha3_shake_128_hexdigest_impl(self, length);

https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Modules/_sha3/clinic/sha3module.c.h
msg385159 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-01-17 11:31
Do you have any benchmarks that back up your claim that integers are faster than using digest or hexdigests? Python's str and bytes types are highly optimized.

Hash digests don't fit into native integers, because they are larger than uint64_t and therefore have to be converted into arbitrary size integers (aka bigints). Arbitrary size integers have an overhead. For example it's slower to convert bytes to an integer than to hex string. Comparison of long its takes about as much time as comparing bytes.

By the way int(h.hexdigest(), 16) is a slow and inefficient way to convert a hash digest into an integer. int.from_bytes(h.digest(), endian) is much faster.

$ ./python -m timeit -s "from hashlib import sha256; s = sha256()" "s.digest()"
500000 loops, best of 5: 450 nsec per loop
$ ./python -m timeit -s "from hashlib import sha256; s = sha256()" "s.hexdigest()"
500000 loops, best of 5: 615 nsec per loop
$ ./python -m timeit -s "from hashlib import sha256; s = sha256()" "int.from_bytes(s.digest(), 'big')"
500000 loops, best of 5: 809 nsec per loop
$ ./python -m timeit -s "from hashlib import sha256; s = sha256()" "int(s.hexdigest(), 16)"
200000 loops, best of 5: 1.03 usec per loop
msg385160 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-01-17 11:35
Is there any particular reason you are using bisect search with sorted list of integers? Why don't you use a simple approach with a dict of digest bytes? bisect search is O(log(n)), dict lookup is O(1) and therefore scales much better.
msg385165 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2021-01-17 19:01
Agreed, using a dict or set hash table lookup is more appropriate for such an algorithm.

Also agreed: comparing python integers (30-bit digit bignums internally) cannot be faster than comparing a binary bytes object.
msg385166 - (view) Author: Arnim Rupp (2d4d) Date: 2021-01-17 19:03
oh, you're absolutely right about digest(), sorry, mixed the representation with the data. closing this, thanks.
History
Date User Action Args
2022-04-11 14:59:40adminsetgithub: 87108
2021-01-17 19:03:502d4dsetstatus: open -> closed
resolution: not a bug
messages: + msg385166

stage: resolved
2021-01-17 19:01:18gregory.p.smithsetmessages: + msg385165
2021-01-17 11:35:17christian.heimessetmessages: + msg385160
2021-01-17 11:31:15christian.heimessetnosy: + gregory.p.smith, christian.heimes
messages: + msg385159
2021-01-16 21:33:202d4dcreate