classification
Title: Wrong MD5 calculation on really long strings and the Hashlib
Type: security Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Alonso.Vidales, neologix, vstinner
Priority: normal Keywords:

Created on 2013-05-02 08:08 by Alonso.Vidales, last changed 2013-05-02 09:20 by Alonso.Vidales. This issue is now closed.

Files
File name Uploaded Description Edit
Screen Shot 2013-05-02 at 3.13.48 AM.png Alonso.Vidales, 2013-05-02 08:08 Test results
Messages (3)
msg188255 - (view) Author: Alonso Vidales (Alonso.Vidales) Date: 2013-05-02 08:08
Taking part on a contest I found a bug working with a string of 6227020800 characters, all the characters are the 'a' char.
When I execute:
   hashlib.md5(string).hexdigest()
On Python, is takes a pair of seconds (need more to calculate this md5, I think that it crashes without launch any exception), and returns the hash:
   8adbd18519be193db41dd5341a260963
When I execute the same code using Pypy the hash is:
   38fe0c01bfa0eb9d153b034f8408a8b7
In order to know witch one is the correct hash, I created a file with 6227020800 characters on my system and I executed the "md5" of OpenSSL, I think that we can trust on it, obtaining the same result as Pypy:
    unknown7cd1c3ecbb9b:Downloads socialpoint$ wc -c md5_test
     6227020800 md5_test
    unknown7cd1c3ecbb9b:Downloads socialpoint$ md5 md5_test
    MD5 (md5_test) = 38fe0c01bfa0eb9d153b034f8408a8b7
Then Python are doing something wrong.
Pls, find attached a screenshot that shows the result of the tests. If you can't reproduce it, or need more information, please let me know.

My Python version:
Python 2.7.1 (r271:86832, Aug  5 2011, 03:30:24) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin

Pypy version:
Python 2.7.2 (341e1e3821ff, Jun 07 2012, 15:42:54)
[PyPy 1.9.0 with GCC 4.2.1] on darwin
msg188258 - (view) Author: Charles-Fran├žois Natali (neologix) * (Python committer) Date: 2013-05-02 08:59
I'm getting the same hash as CPython with md5sum and openssl, on Linux:

$ wc -c data
6227020800 data
$ md5sum data
8adbd18519be193db41dd5341a260963  data
$ openssl md5 data
MD5(data)= 8adbd18519be193db41dd5341a260963

So it's correct, and your system's openssl version is borked. As for why Pypy returns the same result, I've no clue: maybe it's linked with your system libraries.
msg188259 - (view) Author: Alonso Vidales (Alonso.Vidales) Date: 2013-05-02 09:20
Seems a problem with the system libs (I use MacOS 10.7.5), I just create a file with 6227020800 'a' chars on a Linux env, and the result is:
  8adbd18519be193db41dd5341a260963
I'll try to confirm this.

root@tras2:/var/tmp# pypy create_input.py 
root@tras2:/var/tmp# wc -c md5_test 
6227020800 md5_test
root@tras2:/var/tmp# md5sum md5_test
8adbd18519be193db41dd5341a260963  md5_test
root@tras2:/var/tmp# cat create_input.py 
f = open('md5_test', 'w')
for count in xrange(0, 622702080):
	f.write('aaaaaaaaaa')
f.close()
root@tras2:/var/tmp# cat /proc/version 
Linux version 3.2.13-xxxx-std-ipv6-64 (root@kernel-64.ovh.net) (gcc version 4.3.2 (Debian 4.3.2-1.1) ) #1 SMP Wed Mar 28 11:20:17 UTC 2012
History
Date User Action Args
2013-05-02 09:20:48Alonso.Vidalessetmessages: + msg188259
2013-05-02 08:59:59neologixsetstatus: open -> closed

nosy: + neologix
messages: + msg188258

resolution: not a bug
stage: resolved
2013-05-02 08:26:49vstinnersetnosy: + vstinner
2013-05-02 08:08:09Alonso.Vidalescreate