Issue 27265: Hash of different, specific Decimals created from str is the same

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/71452

classification

Title:	Hash of different, specific Decimals created from str is the same
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	Radosław Szalski, mark.dickinson, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2016-06-08 08:04 by Radosław Szalski, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg267802 - (view)	Author: Radosław Szalski (Radosław Szalski)	Date: 2016-06-08 08:04
Python 2.7.11 (default, May 9 2016, 19:53:39) [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from decimal import Decimal >>> hash(Decimal('0.05')) == hash(Decimal('0.005')) True >>> hash(Decimal(0.05)) == hash(Decimal(0.005)) False Interactive example: https://repl.it/CZUJ/6 Those values are numerically different and IMO should not have equal hashes. I am aware of the quirk with hash(-1) == hash(-2) which is at play here. This only applies to Decimals created from strings as they are hashed differently than float-based ones. What happens? The following equation is True: >>> hash(Decimal('0.005')) == hash(Decimal('0.05')) What I expected to happen? The following equation to be False: >>> hash(Decimal('0.005')) == hash(Decimal('0.05')) Platform: MacBook Pro Retina, 13", early 2015, OSX 10.11.5 Tested on Python Versions: 2.7.3, 2.7.10, 2.7.11 (installed via pyenv), all exhibit the same behavior Relevant (not duplicate) issues I've found: http://bugs.python.org/issue10356 - decimal.py: hash of -1 http://bugs.python.org/issue8188 - Unified hash for numeric types. --- Transcript of the interactive example: https://repl.it/CZUJ/6: from decimal import Decimal # These two different decimals have equal hashes print(hash(Decimal('0.005')) == hash(Decimal('0.05'))) # Internally, Decimal's hash is computed like this (sign, exp + len(int), int) print(hash((0, -2+len('5'), '5')) == hash((0, -3+len('5'), '5'))) # Removing constants, this becomes print(hash(-2+len('5')) == hash(-3+len('5'))) # Which can be further simplified to: print(hash(-1) == hash(-2)) # They both return -2, because at the CPython level, -1 is reserved for errors # Whereas: print(hash(Decimal(0.005)) == hash(Decimal(0.05))) # And this is because Decimals created from floats are hashed as floats
msg267840 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2016-06-08 12:00
There's nothing wrong with two different Decimal objects having the same hash (indeed, it's inevitable, given that there are fewer than 2**64 hash values available, and many more possible Decimal objects). It only becomes a problem if you have a largish naturally-occurring dataset whose values all end up falling into the same hash bucket, resulting in linear-time dict operations instead of constant time. I don't think that's the case here: each example of this form only has two different values with the same hash. @Radosław Szalski: is this causing problems in a real application? If not, I think it should be closed as "won't fix". Note that Python 3 is not subject to this issue: it uses a different hashing technique (as described in the issue 8188 that you already linked to).
msg267862 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2016-06-08 15:12
For what it's worth, here are timings on my machine showing the overhead of the extra equality check when a hash collision occurs. Python 2.7.11 (default, Mar 1 2016, 18:08:21) Type "copyright", "credits" or "license" for more information. IPython 4.2.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: from decimal import Decimal In [2]: set1 = set([Decimal(str(n/1000.0)) for n in range(1, 10)] + [Decimal(str(n/100.0)) for n in range(1, 10)]) In [3]: set2 = set([Decimal(str(n/1000.0)) for n in range(2, 20)]) In [4]: print len(set1), len(set2) # Both sets have the same length 18 18 In [5]: print len(set(map(hash, set1))), len(set(map(hash, set2))) # But set1 has hash collisions 9 18 In [6]: %timeit Decimal('0.005') in set1 # checking elt in the set, first match is the right one The slowest run took 5.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 17.4 µs per loop In [7]: %timeit Decimal('0.05') in set1 # checking elt in the set, collision resolution needed The slowest run took 5.72 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 19.6 µs per loop In [8]: %timeit Decimal('0.005') in set2 # should be similar to the first set1 result The slowest run took 5.99 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 17.3 µs per loop
msg268029 - (view)	Author: Radosław Szalski (Radosław Szalski)	Date: 2016-06-09 15:12
Thanks for the reply and analysis, Mark. My motivation was that as a "clueless user", I shouldn't worry about how Decimals are created. Given two equal numbers, I would expect their behavior (e.g., result of a hash) to be the same as well. In this example, the behavior differs simply based on whether the Decimal was created from a string vs a float. However, since there are bound to be collisions, and the performance overhead is negligible (in case of a collision), and Python 3 solves this problem I agree this can be closed as "won't-fix".
msg268045 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2016-06-09 17:54
> the behavior differs simply based on whether the Decimal was created from a string vs a float That's not quite right: a Decimal object keeps no knowledge of how it was created. The behaviour differs depending on whether the value of the Decimal happens to be exactly representable as a Python float or not. That's necessary to ensure the invariant `x == y` implies `hash(x) == hash(y)` continues to hold across types (though Python 3 has a better way of doing this). So for example `Decimal('0.375')` was created from a string, but will hash equal to the exactly equal float `0.375`: noether:float-proofs mdickinson$ python2 Python 2.7.11 (default, May 1 2016, 08:20:00) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from decimal import Decimal >>> hash(Decimal('0.375')), hash(Decimal(0.375)) (1610579968, 1610579968) >>> hash(0.375) 1610579968
msg268049 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-06-09 18:39
Note that Decimal(0.05) != Decimal('0.05'). >>> Decimal(0.05) Decimal('0.05000000000000000277555756156289135105907917022705078125') >>> hash(Decimal(0.05)) 966367654 >>> hash(Decimal('0.05000000000000000277555756156289135105907917022705078125')) 966367654 >>> hash(0.05) 966367654
msg268053 - (view)	Author: Radosław Szalski (Radosław Szalski)	Date: 2016-06-09 19:09
Thanks for the comments, you are both correct. I think that the issue is resolved now, so I'm closing this a won't fix.

History
Date	User	Action	Args
2022-04-11 14:58:32	admin	set	github: 71452
2016-06-09 19:10:42	serhiy.storchaka	set	stage: resolved
2016-06-09 19:09:39	Radosław Szalski	set	status: open -> closed messages: + msg268053
2016-06-09 18:39:11	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg268049
2016-06-09 17:54:33	mark.dickinson	set	messages: + msg268045
2016-06-09 15:12:48	Radosław Szalski	set	status: pending -> open messages: + msg268029
2016-06-08 15:12:31	mark.dickinson	set	status: open -> pending
2016-06-08 15:12:20	mark.dickinson	set	status: pending -> open messages: + msg267862
2016-06-08 12:56:12	mark.dickinson	set	status: open -> pending resolution: wont fix
2016-06-08 12:00:01	mark.dickinson	set	messages: + msg267840
2016-06-08 08:04:21	Radosław Szalski	create