classification
Title: Hash of different, specific Decimals created from str is the same
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Radosław Szalski, mark.dickinson, serhiy.storchaka
Priority: normal Keywords:

Created on 2016-06-08 08:04 by Radosław Szalski, last changed 2016-06-09 19:10 by serhiy.storchaka. This issue is now closed.

Messages (7)
msg267802 - (view) Author: Radosław Szalski (Radosław Szalski) Date: 2016-06-08 08:04
Python 2.7.11 (default, May  9 2016, 19:53:39)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from decimal import Decimal
>>> hash(Decimal('0.05')) == hash(Decimal('0.005'))
True
>>> hash(Decimal(0.05)) == hash(Decimal(0.005))
False

Interactive example: https://repl.it/CZUJ/6

Those values are numerically different and IMO should not have equal hashes. I am aware of the quirk with hash(-1) == hash(-2) which is at play here.
This only applies to Decimals created from strings as they are hashed differently than float-based ones.


What happens? The following equation is True:
>>> hash(Decimal('0.005')) == hash(Decimal('0.05'))

What I expected to happen? The following equation to be False:
>>> hash(Decimal('0.005')) == hash(Decimal('0.05'))


Platform: MacBook Pro Retina, 13", early 2015, OSX 10.11.5
Tested on Python Versions: 2.7.3, 2.7.10, 2.7.11 (installed via pyenv), all exhibit the same behavior


Relevant (not duplicate) issues I've found: 

http://bugs.python.org/issue10356 - decimal.py: hash of -1
http://bugs.python.org/issue8188 - Unified hash for numeric types.



---


Transcript of the interactive example: https://repl.it/CZUJ/6:

from decimal import Decimal

# These two different decimals have equal hashes
print(hash(Decimal('0.005')) == hash(Decimal('0.05')))

# Internally, Decimal's hash is computed like this (sign, exp + len(int), int)
print(hash((0, -2+len('5'), '5')) == hash((0, -3+len('5'), '5')))

# Removing constants, this becomes
print(hash(-2+len('5')) == hash(-3+len('5')))

# Which can be further simplified to:
print(hash(-1) == hash(-2))

# They both return -2, because at the CPython level, -1 is reserved for errors

# Whereas:
print(hash(Decimal(0.005)) == hash(Decimal(0.05)))

# And this is because Decimals created from floats are hashed as floats
msg267840 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-06-08 12:00
There's nothing wrong with two different Decimal objects having the same hash (indeed, it's inevitable, given that there are fewer than 2**64 hash values available, and many more possible Decimal objects). It only becomes a problem if you have a largish naturally-occurring dataset whose values all end up falling into the same hash bucket, resulting in linear-time dict operations instead of constant time.

I don't think that's the case here: each example of this form only has two different values with the same hash.

@Radosław Szalski: is this causing problems in a real application? If not, I think it should be closed as "won't fix".

Note that Python 3 is not subject to this issue: it uses a different hashing technique (as described in the issue 8188 that you already linked to).
msg267862 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-06-08 15:12
For what it's worth, here are timings on my machine showing the overhead of the extra equality check when a hash collision occurs.

Python 2.7.11 (default, Mar  1 2016, 18:08:21) 
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from decimal import Decimal

In [2]: set1 = set([Decimal(str(n/1000.0)) for n in range(1, 10)] + [Decimal(str(n/100.0)) for n in range(1, 10)])

In [3]: set2 = set([Decimal(str(n/1000.0)) for n in range(2, 20)])

In [4]: print len(set1), len(set2)  # Both sets have the same length
18 18

In [5]: print len(set(map(hash, set1))), len(set(map(hash, set2)))  # But set1 has hash collisions
9 18

In [6]: %timeit Decimal('0.005') in set1  # checking elt in the set, first match is the right one
The slowest run took 5.98 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.4 µs per loop

In [7]: %timeit Decimal('0.05') in set1  # checking elt in the set, collision resolution needed
The slowest run took 5.72 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 19.6 µs per loop

In [8]: %timeit Decimal('0.005') in set2  # should be similar to the first set1 result
The slowest run took 5.99 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.3 µs per loop
msg268029 - (view) Author: Radosław Szalski (Radosław Szalski) Date: 2016-06-09 15:12
Thanks for the reply and analysis, Mark.

My motivation was that as a "clueless user", I shouldn't worry about how Decimals are created. Given two equal numbers, I would expect their behavior (e.g., result of a hash) to be the same as well. In this example, the behavior differs simply based on whether the Decimal was created from a string vs a float.

However, since there are bound to be collisions, and the performance overhead is negligible (in case of a collision), and Python 3 solves this problem I agree this can be closed as "won't-fix".
msg268045 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-06-09 17:54
> the behavior differs simply based on whether the Decimal was created from a string vs a float

That's not quite right: a Decimal object keeps no knowledge of how it was created. The behaviour differs depending on whether the value of the Decimal happens to be exactly representable as a Python float or not. That's necessary to ensure the invariant `x == y` implies `hash(x) == hash(y)` continues to hold across types (though Python 3 has a better way of doing this).

So for example `Decimal('0.375')` was created from a string, but will hash equal to the exactly equal float `0.375`:

noether:float-proofs mdickinson$ python2
Python 2.7.11 (default, May  1 2016, 08:20:00) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from decimal import Decimal
>>> hash(Decimal('0.375')), hash(Decimal(0.375))
(1610579968, 1610579968)
>>> hash(0.375)
1610579968
msg268049 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-09 18:39
Note that Decimal(0.05) != Decimal('0.05').

>>> Decimal(0.05)
Decimal('0.05000000000000000277555756156289135105907917022705078125')
>>> hash(Decimal(0.05))
966367654
>>> hash(Decimal('0.05000000000000000277555756156289135105907917022705078125'))
966367654
>>> hash(0.05)
966367654
msg268053 - (view) Author: Radosław Szalski (Radosław Szalski) Date: 2016-06-09 19:09
Thanks for the comments, you are both correct. I think that the issue is resolved now, so I'm closing this a won't fix.
History
Date User Action Args
2016-06-09 19:10:42serhiy.storchakasetstage: resolved
2016-06-09 19:09:39Radosław Szalskisetstatus: open -> closed

messages: + msg268053
2016-06-09 18:39:11serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg268049
2016-06-09 17:54:33mark.dickinsonsetmessages: + msg268045
2016-06-09 15:12:48Radosław Szalskisetstatus: pending -> open

messages: + msg268029
2016-06-08 15:12:31mark.dickinsonsetstatus: open -> pending
2016-06-08 15:12:20mark.dickinsonsetstatus: pending -> open

messages: + msg267862
2016-06-08 12:56:12mark.dickinsonsetstatus: open -> pending
resolution: wont fix
2016-06-08 12:00:01mark.dickinsonsetmessages: + msg267840
2016-06-08 08:04:21Radosław Szalskicreate