Issue27265

Created on **2016-06-08 08:04** by **Radosław Szalski**, last changed **2016-06-09 19:10** by **serhiy.storchaka**. This issue is now **closed**.

Messages (7) | |||
---|---|---|---|

msg267802 - (view) | Author: Radosław Szalski (Radosław Szalski) | Date: 2016-06-08 08:04 | |

Python 2.7.11 (default, May 9 2016, 19:53:39) [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from decimal import Decimal >>> hash(Decimal('0.05')) == hash(Decimal('0.005')) True >>> hash(Decimal(0.05)) == hash(Decimal(0.005)) False Interactive example: https://repl.it/CZUJ/6 Those values are numerically different and IMO should not have equal hashes. I am aware of the quirk with hash(-1) == hash(-2) which is at play here. This only applies to Decimals created from strings as they are hashed differently than float-based ones. What happens? The following equation is True: >>> hash(Decimal('0.005')) == hash(Decimal('0.05')) What I expected to happen? The following equation to be False: >>> hash(Decimal('0.005')) == hash(Decimal('0.05')) Platform: MacBook Pro Retina, 13", early 2015, OSX 10.11.5 Tested on Python Versions: 2.7.3, 2.7.10, 2.7.11 (installed via pyenv), all exhibit the same behavior Relevant (not duplicate) issues I've found: http://bugs.python.org/issue10356 - decimal.py: hash of -1 http://bugs.python.org/issue8188 - Unified hash for numeric types. --- Transcript of the interactive example: https://repl.it/CZUJ/6: from decimal import Decimal # These two different decimals have equal hashes print(hash(Decimal('0.005')) == hash(Decimal('0.05'))) # Internally, Decimal's hash is computed like this (sign, exp + len(int), int) print(hash((0, -2+len('5'), '5')) == hash((0, -3+len('5'), '5'))) # Removing constants, this becomes print(hash(-2+len('5')) == hash(-3+len('5'))) # Which can be further simplified to: print(hash(-1) == hash(-2)) # They both return -2, because at the CPython level, -1 is reserved for errors # Whereas: print(hash(Decimal(0.005)) == hash(Decimal(0.05))) # And this is because Decimals created from floats are hashed as floats |
|||

msg267840 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2016-06-08 12:00 | |

There's nothing wrong with two different Decimal objects having the same hash (indeed, it's inevitable, given that there are fewer than 2**64 hash values available, and many more possible Decimal objects). It only becomes a problem if you have a largish naturally-occurring dataset whose values all end up falling into the same hash bucket, resulting in linear-time dict operations instead of constant time. I don't think that's the case here: each example of this form only has two different values with the same hash. @Radosław Szalski: is this causing problems in a real application? If not, I think it should be closed as "won't fix". Note that Python 3 is not subject to this issue: it uses a different hashing technique (as described in the issue 8188 that you already linked to). |
|||

msg267862 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2016-06-08 15:12 | |

For what it's worth, here are timings on my machine showing the overhead of the extra equality check when a hash collision occurs. Python 2.7.11 (default, Mar 1 2016, 18:08:21) Type "copyright", "credits" or "license" for more information. IPython 4.2.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: from decimal import Decimal In [2]: set1 = set([Decimal(str(n/1000.0)) for n in range(1, 10)] + [Decimal(str(n/100.0)) for n in range(1, 10)]) In [3]: set2 = set([Decimal(str(n/1000.0)) for n in range(2, 20)]) In [4]: print len(set1), len(set2) # Both sets have the same length 18 18 In [5]: print len(set(map(hash, set1))), len(set(map(hash, set2))) # But set1 has hash collisions 9 18 In [6]: %timeit Decimal('0.005') in set1 # checking elt in the set, first match is the right one The slowest run took 5.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 17.4 µs per loop In [7]: %timeit Decimal('0.05') in set1 # checking elt in the set, collision resolution needed The slowest run took 5.72 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 19.6 µs per loop In [8]: %timeit Decimal('0.005') in set2 # should be similar to the first set1 result The slowest run took 5.99 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 17.3 µs per loop |
|||

msg268029 - (view) | Author: Radosław Szalski (Radosław Szalski) | Date: 2016-06-09 15:12 | |

Thanks for the reply and analysis, Mark. My motivation was that as a "clueless user", I shouldn't worry about how Decimals are created. Given two equal numbers, I would expect their behavior (e.g., result of a hash) to be the same as well. In this example, the behavior differs simply based on whether the Decimal was created from a string vs a float. However, since there are bound to be collisions, and the performance overhead is negligible (in case of a collision), and Python 3 solves this problem I agree this can be closed as "won't-fix". |
|||

msg268045 - (view) | Author: Mark Dickinson (mark.dickinson) * | Date: 2016-06-09 17:54 | |

> the behavior differs simply based on whether the Decimal was created from a string vs a float That's not quite right: a Decimal object keeps no knowledge of how it was created. The behaviour differs depending on whether the value of the Decimal happens to be exactly representable as a Python float or not. That's necessary to ensure the invariant `x == y` implies `hash(x) == hash(y)` continues to hold across types (though Python 3 has a better way of doing this). So for example `Decimal('0.375')` was created from a string, but will hash equal to the exactly equal float `0.375`: noether:float-proofs mdickinson$ python2 Python 2.7.11 (default, May 1 2016, 08:20:00) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from decimal import Decimal >>> hash(Decimal('0.375')), hash(Decimal(0.375)) (1610579968, 1610579968) >>> hash(0.375) 1610579968 |
|||

msg268049 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2016-06-09 18:39 | |

Note that Decimal(0.05) != Decimal('0.05'). >>> Decimal(0.05) Decimal('0.05000000000000000277555756156289135105907917022705078125') >>> hash(Decimal(0.05)) 966367654 >>> hash(Decimal('0.05000000000000000277555756156289135105907917022705078125')) 966367654 >>> hash(0.05) 966367654 |
|||

msg268053 - (view) | Author: Radosław Szalski (Radosław Szalski) | Date: 2016-06-09 19:09 | |

Thanks for the comments, you are both correct. I think that the issue is resolved now, so I'm closing this a won't fix. |

History | |||
---|---|---|---|

Date | User | Action | Args |

2016-06-09 19:10:42 | serhiy.storchaka | set | stage: resolved |

2016-06-09 19:09:39 | Radosław Szalski | set | status: open -> closed messages: + msg268053 |

2016-06-09 18:39:11 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg268049 |

2016-06-09 17:54:33 | mark.dickinson | set | messages: + msg268045 |

2016-06-09 15:12:48 | Radosław Szalski | set | status: pending -> open messages: + msg268029 |

2016-06-08 15:12:31 | mark.dickinson | set | status: open -> pending |

2016-06-08 15:12:20 | mark.dickinson | set | status: pending -> open messages: + msg267862 |

2016-06-08 12:56:12 | mark.dickinson | set | status: open -> pending resolution: wont fix |

2016-06-08 12:00:01 | mark.dickinson | set | messages: + msg267840 |

2016-06-08 08:04:21 | Radosław Szalski | create |