classification
Title: __hash__ documentation recommends naive XOR to combine but this is suboptimal
Type: performance Stage:
Components: Documentation Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Kevin.Norris, belopolsky, christian.heimes, docs@python, eric.araujo, inada.naoki, python-dev, vstinner
Priority: normal Keywords: patch

Created on 2016-10-07 04:37 by Kevin.Norris, last changed 2016-12-19 12:17 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
hash_doc.patch vstinner, 2016-10-18 14:49 review
Messages (9)
msg278229 - (view) Author: Kevin Norris (Kevin.Norris) Date: 2016-10-07 04:37
The documentation for __hash__ contains this text:

"The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects."

The recommendation of "using exclusive or" is likely to result in programmers naively doing this:

def __hash__(self):
    return hash(self.thing1) ^ hash(self.thing2) ^ hash(self.thing3)

In the event that (say) self.thing1 and self.thing2 have almost or exactly the same hash (with "almost" referring to bitwise differences rather than integral distance), this wipes out most or all of the entropy from both values and greatly increases the likelihood of hash collisions.  Indeed, Python's own tuple type does not do this (while it does use XOR, it also does some other math to ensure the bits are as mixed up as is practical).[1]

Because the correct algorithm is both nontrivial to implement and already exists in the tuple type's __hash__, I propose that the documentation be updated to recommend something like the following:

def __hash__(self):
    return hash((self.thing1, self.thing2, self.thing3))

One possible wording:

"The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple: [code example]"

[1]: https://hg.python.org/cpython/file/fca5c4a63251/Objects/tupleobject.c#l348
msg278254 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2016-10-07 16:38
This makes sense.  Note that this is the way hashes are implemented for the datetime objects: <https://hg.python.org/cpython/file/v3.6.0b1/Lib/datetime.py#l635>.
msg278886 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-18 14:49
hash(tuple of attributes) is what I'm using in all my projects.

Here is a patch for the doc.
msg278887 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2016-10-18 14:50
ACK!
msg278912 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-18 17:00
Christian Heimes:
>  ACK!

Does it mean that my patch LGTY?
msg283610 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-12-19 12:05
LGTM
msg283611 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-12-19 12:10
New changeset cb802a78ceea by Victor Stinner in branch '3.5':
doc: Suggest to hash(tuple of attr) rather than XOR
https://hg.python.org/cpython/rev/cb802a78ceea
msg283614 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-12-19 12:16
New changeset fac2362f248c by Victor Stinner in branch '2.7':
doc: Suggest to hash(tuple of attr) rather than XOR
https://hg.python.org/cpython/rev/fac2362f248c
msg283615 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-19 12:17
I updated the doc of Python 2.7, 3.5, 3.6 and default (3.7).

Thanks for the suggestion Kevin, thanks for the review Naoki.
History
Date User Action Args
2016-12-19 12:17:11vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg283615

versions: - Python 3.3, Python 3.4
2016-12-19 12:16:21python-devsetmessages: + msg283614
2016-12-19 12:10:39python-devsetnosy: + python-dev
messages: + msg283611
2016-12-19 12:05:45inada.naokisetnosy: + inada.naoki
messages: + msg283610
2016-10-18 17:00:12vstinnersetmessages: + msg278912
2016-10-18 14:50:27christian.heimessetnosy: + christian.heimes
messages: + msg278887
2016-10-18 14:49:02vstinnersetfiles: + hash_doc.patch

nosy: + vstinner
messages: + msg278886

keywords: + patch
2016-10-07 16:38:44belopolskysetnosy: + belopolsky
messages: + msg278254
2016-10-07 16:29:00eric.araujosetnosy: + eric.araujo
2016-10-07 04:37:23Kevin.Norriscreate