Message 349853 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mark.dickinson
Recipients	belopolsky, benjamin.peterson, christian.heimes, dmalcolm, epicfaace, lemburg, mark.dickinson, rhettinger, serhiy.storchaka
Date	2019-08-16.08:54:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1565945682.51.0.189873766497.issue29535@roundup.psfhosted.org>
In-reply-to

Content
> Why so? Python's hash needs to obey the invariant that `hash(x) == hash(y)` for any two hashable objects with `x == y`. That makes life particularly hard for numeric types, because there are a number of different numeric types where equality can hold across those types. This includes not just built-in types, but third party types as well (think of NumPy, gmpy2, SymPy, and other libraries providing numbers that need to compare equal to Python numbers with the same value). So for example, `hash(1.5)`, `hash(Decimal("1.5"))`, `hash(Fraction(3, 2))`, `hash(1.5 + 0j)`, `hash(numpy.float32(1.5))`, `hash(bigfloat.BigFloat(1.5, precision=200))` must _all_ be equal to one another within a single running Python process. Moreover, hash computation needs to be efficient for common types like floats and integers, while also not being impossibly slow for other types. (Early versions of Decimal's hash did a check to see whether the Decimal instance was an exact integer, and if so, converted that Decimal instance to an integer before taking its hash. But doing that with `Decimal(1e999999)` doesn't go well.) It would definitely be possible to: - compute a hash in a cross-type-compatible way - do some sort of uniform post-processing of that hash, incorporating information from a per-process random salt The operations described by Melissa O'Neill in her PCG paper give ideas for some ways to do such post-processing: regard the hash and the salt as together forming a 128-bit integer, and then collapse that 128-integer down to a 64-bit integer using one of the PCG post-processing methods. Note that as far as I know there's been no validation of those methods from a cryptographic (rather than a statistical) perspective. However, it would be significant work, be disruptive not just to CPython, but to 3rd party packages and to other Python implementations, would slow down common hashing operations, and would increase the amount and the complexity of code that has to be maintained into the future. So there's no shortage of reasons not to change the numeric hash. What I think we're lacking is a single reason to change it. Can you give a plausible example of a situation where the predictability of the numeric hash can lead to possible security issues? See also the recent issue #37807. > but not if my keys, say, are tuples of strings Bad example. :-) The hash of a tuple is based on the hash of its contents. So if those contents are strings, the tuple benefits from the string hash randomization. mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))" -824966383135019564 mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))" -5971473524922642515 mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))" 5384650403450490974

> Why so?

Python's hash needs to obey the invariant that `hash(x) == hash(y)` for any two hashable objects with `x == y`. That makes life particularly hard for numeric types, because there are a number of *different* numeric types where equality can hold across those types. This includes not just built-in types, but third party types as well (think of NumPy, gmpy2, SymPy, and other libraries providing numbers that need to compare equal to Python numbers with the same value).

So for example, `hash(1.5)`,  `hash(Decimal("1.5"))`, `hash(Fraction(3, 2))`, `hash(1.5 + 0j)`, `hash(numpy.float32(1.5))`, `hash(bigfloat.BigFloat(1.5, precision=200))` must _all_ be equal to one another within a single running Python process.

Moreover, hash computation needs to be efficient for common types like floats and integers, while also not being impossibly slow for other types. (Early versions of Decimal's hash did a check to see whether the Decimal instance was an exact integer, and if so, converted that Decimal instance to an integer before taking its hash. But doing that with `Decimal(1e999999)` doesn't go well.)

It would definitely be *possible* to:

- compute a hash in a cross-type-compatible way
- do some sort of uniform post-processing of that hash, incorporating information from a per-process random salt

The operations described by Melissa O'Neill in her PCG paper give ideas for some ways to do such post-processing: regard the hash and the salt as together forming a 128-bit integer, and then collapse that 128-integer down to a 64-bit integer using one of the PCG post-processing methods. Note that as far as I know there's been no validation of those methods from a cryptographic (rather than a statistical) perspective.

However, it would be significant work, be disruptive not just to CPython, but to 3rd party packages and to other Python implementations, would slow down common hashing operations, and would increase the amount and the complexity of code that has to be maintained into the future.

So there's no shortage of reasons *not* to change the numeric hash. What I think we're lacking is a single reason *to* change it. Can you give a plausible example of a situation where the predictability of the numeric hash can lead to possible security issues?

See also the recent issue #37807.

> but *not* if my keys, say, are tuples of strings

Bad example. :-) The hash of a tuple is based on the hash of its contents. So if those contents are strings, the tuple benefits from the string hash randomization.

mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-824966383135019564
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-5971473524922642515
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
5384650403450490974

History
Date	User	Action	Args
2019-08-16 08:54:42	mark.dickinson	set	recipients: + mark.dickinson, lemburg, rhettinger, belopolsky, christian.heimes, benjamin.peterson, dmalcolm, serhiy.storchaka, epicfaace
2019-08-16 08:54:42	mark.dickinson	set	messageid: <1565945682.51.0.189873766497.issue29535@roundup.psfhosted.org>
2019-08-16 08:54:42	mark.dickinson	link	issue29535 messages
2019-08-16 08:54:42	mark.dickinson	create