This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Please make it possible to make the output of hash() equal between 32 and 64 bit architectures
Type: Stage:
Components: Versions: Python 3.5
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: ethan.furman, georg.brandl, josch, rhettinger
Priority: normal Keywords:

Created on 2014-10-13 06:09 by josch, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (6)
msg229219 - (view) Author: josch (josch) Date: 2014-10-13 06:09
I recently realized that the output of the following is different between 32 bit and 64 bit architectures:

PYTHONHASHSEED=0 python3 -c 'print(hash("a"))'

In my case, I'm running some test cases which involve calling a Python module which creates several hundred megabyte big graphs and other things. The fastest way to make sure that the output I get is the same that I expect is to just call the md5sum or sha256sum shell tools on the output and compare them with the expected values. Unfortunately, some libraries I use rely on the order of items in Python dictionaries for their output. Yes, they should not do that but they also don't care and thus don't fix the problem.

My initial solution to this was to use PYTHONHASHSEED=0 which helped but I now found out that this is limited to producing the same hash within the set of 32 bit and 64 bit architectures, respectively. See above line which behaves different depending on the integer size of architectures.

So what I'd like CPython to have is yet another workaround like PYTHONHASHSEED which allows me to temporarily influence the inner workings of the hash() function such that it behaves the same on 32 bit and 64 bit architectures. Maybe something like PYTHONHASH32BIT or similar?

If I understand the CPython hash function correctly, then this environment variable would just bitmask the result of the function with 0xFFFFFFFF or cast it to int32_t to achieve the same output across architectures.

Would this be possible?

My only alternative seems to be to either maintain patched versions of all modules I use which wrongly rely on dictionary ordering or to go to great lengths of parsing the (more or less) random output they produce into a sorted intermediary format - which seems like a bad idea because the files are several hundred megabytes big and this would just take very long and require additional complexity in handling them compared to being able to just md5sum or sha256sum them for the sake of checking whether my test cases succeed or not.
msg229222 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-10-13 06:57
While I can feel your pain regarding the use case you describe, I don't think this has enough general value to add to CPython.  It is not really related to PYTHONHASHSEED, since we never made guarantees about hash values being stable across platforms and Python versions.  PYTHONHASHSEED was introduced to address backwards compatibility for the rare cases where stable hash values are required within a platform/version combination.

Without knowing anything about your libraries, would it not be possible to create a stable representation within the test case for comparison purposes, without having to write the unstable result to a file and then parsing it?  That should be acceptable, given that creating and manipulating those graphs will probably also take significant time in the first place.
msg229223 - (view) Author: josch (josch) Date: 2014-10-13 07:15
Thank you for your quick reply.

Yes, as I wrote above there are ways around it by creating a stable in-memory representation and comparing that to a stable in-memory representation of the expected output. Since both input are several hundred megabytes in size, this would be CPU intensive but do-able. I would've just likeld to avoid treating this output in a special way because I also compare other files and it is most easy to just md5sum all of the files in one fell swoop.

I started using PYTHONHASHSEED to gain stable output for a certain platform/version combination. When I uploaded my package to Debian and it was built on 13 different architectures I noticed the descrepancy when the same version but different platforms are involved.

From my perspective it would be nice to just be able to set PYTHONHASH32BIT (or whatever) and call it a day. But of course it is your choice whether you would allow such a "hack" or not.

Would your decision be more favorable if you received a patch implementing this feature?
msg229226 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-10-13 07:29
> Would your decision be more favorable if you received a patch implementing this feature?

I'll keep this on "pending" for other devs to weigh in with opinions.

In general, we are not keen on keeping text representations stable, as they do not form part of the API.  This is true for exception messages most of all, but also the representations of other types change occasionally.  Doctests and other test methods that rely on exact output, such as yours, have to adapt to that.

The patch wouldn't be difficult to write, but the issue is more that it isn't really generally useful (as evidenced by the fact that you are the first to request it), and it won't save you a lot of work in any case if you want to support existing versions of Python (2.7, 3.x) as well: the new feature could only go into 3.5.
msg229263 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2014-10-13 19:17
Like Georg I am sympathetic to the problem, but this is not the correct solution.

You might post a question on python-list to see if a usable, not-to-painful solution can be found.
msg229275 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-10-14 01:01
> some libraries I use rely on the order of items in Python dictionaries for their output

Your choices are:

* create a custom hash function using __hash__
* or sort the output from within Python
* or sort the output externally, prior to diffing.

> From my perspective it would be nice to just be able 

The odds of this being accepted are almost zero.
We're moving in the opposite direction of randomizing
the hash from run-to-run.
Date User Action Args
2022-04-11 14:58:09adminsetgithub: 66811
2014-10-14 01:01:08rhettingersetnosy: + rhettinger
messages: + msg229275
2014-10-14 00:04:11pitrousetstatus: open -> closed
2014-10-13 19:17:07ethan.furmansetstatus: pending -> open
nosy: + ethan.furman
messages: + msg229263

2014-10-13 07:29:38georg.brandlsetstatus: open -> pending
resolution: rejected
messages: + msg229226
2014-10-13 07:15:23joschsetmessages: + msg229223
2014-10-13 06:57:26georg.brandlsetnosy: + georg.brandl
messages: + msg229222
2014-10-13 06:09:23joschcreate