Message 229219 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josch
Recipients	josch
Date	2014-10-13.06:09:22
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1413180563.52.0.548204621913.issue22621@psf.upfronthosting.co.za>
In-reply-to

Content
I recently realized that the output of the following is different between 32 bit and 64 bit architectures: PYTHONHASHSEED=0 python3 -c 'print(hash("a"))' In my case, I'm running some test cases which involve calling a Python module which creates several hundred megabyte big graphs and other things. The fastest way to make sure that the output I get is the same that I expect is to just call the md5sum or sha256sum shell tools on the output and compare them with the expected values. Unfortunately, some libraries I use rely on the order of items in Python dictionaries for their output. Yes, they should not do that but they also don't care and thus don't fix the problem. My initial solution to this was to use PYTHONHASHSEED=0 which helped but I now found out that this is limited to producing the same hash within the set of 32 bit and 64 bit architectures, respectively. See above line which behaves different depending on the integer size of architectures. So what I'd like CPython to have is yet another workaround like PYTHONHASHSEED which allows me to temporarily influence the inner workings of the hash() function such that it behaves the same on 32 bit and 64 bit architectures. Maybe something like PYTHONHASH32BIT or similar? If I understand the CPython hash function correctly, then this environment variable would just bitmask the result of the function with 0xFFFFFFFF or cast it to int32_t to achieve the same output across architectures. Would this be possible? My only alternative seems to be to either maintain patched versions of all modules I use which wrongly rely on dictionary ordering or to go to great lengths of parsing the (more or less) random output they produce into a sorted intermediary format - which seems like a bad idea because the files are several hundred megabytes big and this would just take very long and require additional complexity in handling them compared to being able to just md5sum or sha256sum them for the sake of checking whether my test cases succeed or not.

I recently realized that the output of the following is different between 32 bit and 64 bit architectures:

PYTHONHASHSEED=0 python3 -c 'print(hash("a"))'

In my case, I'm running some test cases which involve calling a Python module which creates several hundred megabyte big graphs and other things. The fastest way to make sure that the output I get is the same that I expect is to just call the md5sum or sha256sum shell tools on the output and compare them with the expected values. Unfortunately, some libraries I use rely on the order of items in Python dictionaries for their output. Yes, they should not do that but they also don't care and thus don't fix the problem.

My initial solution to this was to use PYTHONHASHSEED=0 which helped but I now found out that this is limited to producing the same hash within the set of 32 bit and 64 bit architectures, respectively. See above line which behaves different depending on the integer size of architectures.

So what I'd like CPython to have is yet another workaround like PYTHONHASHSEED which allows me to temporarily influence the inner workings of the hash() function such that it behaves the same on 32 bit and 64 bit architectures. Maybe something like PYTHONHASH32BIT or similar?

If I understand the CPython hash function correctly, then this environment variable would just bitmask the result of the function with 0xFFFFFFFF or cast it to int32_t to achieve the same output across architectures.

Would this be possible?

My only alternative seems to be to either maintain patched versions of all modules I use which wrongly rely on dictionary ordering or to go to great lengths of parsing the (more or less) random output they produce into a sorted intermediary format - which seems like a bad idea because the files are several hundred megabytes big and this would just take very long and require additional complexity in handling them compared to being able to just md5sum or sha256sum them for the sake of checking whether my test cases succeed or not.

History
Date	User	Action	Args
2014-10-13 06:09:23	josch	set	recipients: + josch
2014-10-13 06:09:23	josch	set	messageid: <1413180563.52.0.548204621913.issue22621@psf.upfronthosting.co.za>
2014-10-13 06:09:23	josch	link	issue22621 messages
2014-10-13 06:09:22	josch	create