classification
Title: hashlib.md5 / json inconsistency
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: hynek, poppe1219, r.david.murray
Priority: normal Keywords:

Created on 2012-10-09 19:11 by poppe1219, last changed 2012-10-10 10:13 by poppe1219. This issue is now closed.

Messages (4)
msg172506 - (view) Author: Robin Åsén (poppe1219) Date: 2012-10-09 19:11
I am getting inconsistent behavior when getting an md5 hexdigest on a json structure that's converted to a string.
Am I doing something wrong here?

    import json
    import hashlib

    data = '''{"key1":"value1","key2":"value2"}'''

    print(hashlib.md5(data.encode()).hexdigest())
    jsonData = json.loads(data)
    print(hashlib.md5(str(jsonData).encode()).hexdigest())
    print(hashlib.md5(str(jsonData).encode()).hexdigest())

When I run this code everything seems just fine at a first glance. However, when it is run again I get different md5 checksums.
The first md5 checksum on the data string seems consistent every time.
The two last md5 checksums never seems to contradict each other during the same run, but between each run I often get different values.
Here are some outputs I'm getting:

ff45cc3835165307ef414c23ca2c6f67
423b2b4d92c0947e3d99d207c7c06175
423b2b4d92c0947e3d99d207c7c06175

ff45cc3835165307ef414c23ca2c6f67
101d66cd2878eacf47c618cea6862125
101d66cd2878eacf47c618cea6862125

ff45cc3835165307ef414c23ca2c6f67
423b2b4d92c0947e3d99d207c7c06175
423b2b4d92c0947e3d99d207c7c06175

ff45cc3835165307ef414c23ca2c6f67
101d66cd2878eacf47c618cea6862125
101d66cd2878eacf47c618cea6862125


(If it makes any difference, I'm running on Windows XP SP3)
msg172509 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-10-09 19:39
The order in which elements are produced when iterating a dictionary is not fixed.  In python3.3 it is intentionally perturbed by a randomized seed at interpreter startup by default.
msg172559 - (view) Author: Hynek Schlawack (hynek) * (Python committer) Date: 2012-10-10 09:58
Actually, that’s not the point here, the code has a deeper flaw.

You’re computing hashlib.md5() on `data.encode()` and `str(jsonData).encode()`. Did you have a look how they look like?

>>> data.encode()
b'{"key1":"value1","key2":"value2"}'
[71875 refs]
>>> str(jsonData).encode()
b"{'key1': 'value1', 'key2': 'value2'}"

`str(jsonData)` doesn’t return JSON because it’s a simple dict():

>>> type(jsonData)
<class 'dict'>

If you wanted to have JSON again, you’d have to use `json.dumps()`:

>>> json.dumps(jsonData)
'{"key1": "value1", "key2": "value2"}'

HOWEVER: This string _also_ differs from yours due to additional whitespace, ie. the sum would differ again.

Additionally, as David pointed out, you can’t rely on the order of the dict. json.dump() could just as well return `'{"key2": "value2", "key1": "value1"}'`.
msg172560 - (view) Author: Robin Åsén (poppe1219) Date: 2012-10-10 10:13
Yes, you are quite right.
Somewhere in the back of my head I had a feeling I should understand what was happening, hence my comment "Am I doing something wrong here?".
I just couldn't see it. 

Thank you.
History
Date User Action Args
2012-10-10 10:13:26poppe1219setmessages: + msg172560
2012-10-10 09:58:43hyneksetnosy: + hynek
messages: + msg172559
2012-10-09 19:39:30r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg172509

resolution: not a bug
stage: resolved
2012-10-09 19:11:23poppe1219create