This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: sha and md5 modules should use OpenSSL when possible
Type: Stage:
Components: Extension Modules Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: arigo, gregory.p.smith, jimjjewett, terry.reedy
Priority: normal Keywords: patch

Created on 2005-02-13 01:33 by gregory.p.smith, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
hashlib-008.patch gregory.p.smith, 2005-06-12 03:14 adds documentation and cleanup
hashlib-009.patch gregory.p.smith, 2005-08-15 03:28 use py2.2 iface for methods and attrs, doc updates
hashlib-010.patch gregory.p.smith, 2005-08-21 18:50 final patch - committed 2005-08-21 ~18:00 UTC
Messages (15)
msg47773 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-02-13 01:33
The md5 and sha (sha1) modules should use OpenSSL for
the algorithms when it is available as its
implementations are much faster than pythons own.

Attached is an initial patch to use OpenSSL for the sha
module.  Its not ready for committing as is yet, but it is
setup to be a generic base for all OpenSSL hashes with
a little bit of work in the future.  Tossing this out
there for people to see how trivial it is and enjoy the
speedups.

diff is against HEAD but it should apply to 2.4 just fine.
msg47774 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-02-17 06:46
Logged In: YES 
user_id=413

hashes-openssl-002.patch  replaces the sha and md5 modules
with a general hashes module that wraps all hashes that
OpenSSL supports.

note that OpenSSLs implementations are much faster than the
previous python versions as it choses versions optimized for
your particular hardware.

Incase python is compiled without openssl the hashes wrapper
falls back on the old python sha and md5 module implementations.

side note: This may be sufficient for the Debian folks to
work around their random odd licensing issue.  just have
debian python depend on openssl; use this and remove the old
md5 module/code that wouldn't get used anyways.
msg47775 - (view) Author: Jim Jewett (jimjjewett) Date: 2005-02-18 19:21
Logged In: YES 
user_id=764593

Should the private modules (such as _sha) be placed in a 
crypto package, instead of directly in the 
parent/everything library?
msg47776 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-02-28 18:11
Logged In: YES 
user_id=413

a much updated patch (hashlib-patch-004.patch).  it
incorporates some suggestions as well as including sf patch
935454's sha256/224 and sha512/384 implementations.

still not complete but shows the direction its going in (i
see a segfault part way thru the test suite after running
the sha512 tests).

as for the private modules being under another package, i
see no reason to do that since there aren't very many (how
does that work for binary modules anyways?).
msg47777 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-03-01 09:14
Logged In: YES 
user_id=413

hashlib-005.patch now passes its test suite and no problems
appear in valgrind.
msg47778 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-03-03 21:15
Logged In: YES 
user_id=413

hashlib-006.patch adds fast constructors and a speed test. 
documentation is the next step.
msg47779 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-03-10 08:09
Logged In: YES 
user_id=413

The 007 patch improves the speed of the constructor.  There
is still a potential speed issue with the
constructor/destructor to work on:

greg@spiff src $ ./python Lib/test/test_hashlib_speed.py _sha
testing speed of old _sha legacy interface
0.06 seconds [20000 creations]
0.24 seconds [20000 "" digests]
0.15 seconds 20 x 106201 bytes [huge data]
0.15 seconds 200 x 10620 bytes [large data]
0.17 seconds 2000 x 1062 bytes [medium data]
0.35 seconds 20020 x 106 bytes [small data]
1.37 seconds 106200 x 20 bytes [digest_size data]
2.75 seconds 212400 x 10 bytes [tiny data]
greg@spiff src $ ./python Lib/test/test_hashlib_speed.py sha1
testing speed of hashlib.sha1 <built-in function openssl_sha1>
0.22 seconds [20000 creations]
0.57 seconds [20000 "" digests]
0.09 seconds 20 x 106201 bytes [huge data]
0.09 seconds 200 x 10620 bytes [large data]
0.15 seconds 2000 x 1062 bytes [medium data]
0.71 seconds 20020 x 106 bytes [small data]
3.39 seconds 106200 x 20 bytes [digest_size data]
6.70 seconds 212400 x 10 bytes [tiny data]

I suspect the cause is either or both of the shared openssl
library call overhead or the openssl EVP abstraction
interface.  The speed results are very similar to the above
regardless of which digest is used (the above was a celeron
333mhz running linux).
msg47780 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-03-13 01:13
Logged In: YES 
user_id=413

I linked a _hashlib.so library statically against openssl
and reran the speed test.  no change.  that means its not
shared library overhead causing the higher startup time but
just an artifact of the OpenSSL EVP interface.

Next up, analyze what size things common heavy sha1 using
applications regularly hash (BitTorrent and such).
msg47781 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-06-12 03:21
Logged In: YES 
user_id=413

Ok, this patch is ready.  documentation has been added. 
I'll bring it up on python-dev for discussion/approval with
a link to the htmlified documentation.

The speedups are great for any application hashing a lot of
data when OpenSSL is used.  It also adds a sha224, sha256,
sha384 and sha512 support.
msg47782 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2005-06-12 12:18
Logged In: YES 
user_id=4771

On a side note, maybe it makes sense for a new module like this one to promote and use the modern (>=2.2) ways of defining C types.

What I have in mind is using tp_methods instead of Py_FindMethod, and generally not reverting to strcmp().  In this case, the constants like 'digest_size' would be best stored as class attributes instead, if possible.  Indeed, allowing expressions like "hashlib.md5.digest_size" conveys the idea that the result doesn't depend on a particular instance, unlike "hashlib.md5().digest_size".  (Of course class attributes are also readable from the instance, as usual.)

I can give it a try if you don't want to invest more time in this patch than you already did (for which we are grateful to you :-)
msg47783 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2005-06-12 20:35
Logged In: YES 
user_id=593130

Re Doc page: As a somewhat naive (relative to the subject) 
reader, the title and first sentence implied that 'secure hash' 
and 'message digest' are two separate things, whereas, judging 
from the .digest() blurb, they both seem to be16-byte hashes.  
So I would prefer this equivalence and the actual meaning were 
made clear at the top.  Something like "This module implements a 
common interface to several secure hash or message digest 
algorithms that produce 16-byte hashes."

If, as I presume, xx.hexdigest() == binascii.hexlify(xx.digest()), 
then I would say so and reference binsacii for the 
interconversion functions one would need if one had the two 
versions to compare or needed to convert after the extraction.
msg47784 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-08-01 01:29
Logged In: YES 
user_id=413

per arigo's suggestion I have a version of _hashopenssl.c in
my sandbox modified to use the more modern C type API.  The
constructor is slightly faster (~1-2%) and does seem like a
better way to do things.  i'll post it after polishing it up.
msg47785 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-08-15 03:28
Logged In: YES 
user_id=413

tjreedy and arigo's comments have been taken into
consideration.  An updated patch (009) has been attached. 
it uses the python >= 2.2 interface for defining methods and
member variables rather than the getattr function with
manual strcmp's.

I was unable to make digest_size and such class attributes
because the hashes are not classes.  The hashlib.md5
function for instance is a constructor function that returns
an appropriate internal HASH object.  The goal of those
constructors is to be as fast as possible; wrapping them
with python in order to make them actual classes would be
too slow and I did not see an obvious way to do it from C.

I believe this patch is ready to commit.  Further
improvements or refinements can be made to it in CVS.

the documentation in html for easy viewing has been updated at

http://electricrain.com/greg/hashlib-py25-doc/
msg47786 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2005-08-15 08:47
Logged In: YES 
user_id=4771

I see that it would indeed be messy to have 'md5' be a type
and 'digest_size' a class attribute given that 'md5' can
come from various places depending on what is installed;
moreover in the hashopenssl.c file unless I'm mistaken all
hashes use the same Python type.  Fine by me.
msg47787 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2005-08-21 18:50
Logged In: YES 
user_id=413

hashlib has been committed to HEAD for inclusion in python 2.5.

I've attached a hashlib-010.patch that is the exact cvs diff
of what i committed after further testing.
History
Date User Action Args
2022-04-11 14:56:09adminsetgithub: 41573
2005-02-13 01:33:11gregory.p.smithcreate