classification
Title: datetime hash is deterministic in some cases
Type: behavior Stage: resolved
Components: Documentation Versions: Python 3.9, Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: belopolsky, benjamin.peterson, christian.heimes, dmalcolm, docs@python, epicfaace, lemburg, mark.dickinson, miss-islington, rhettinger, serhiy.storchaka, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2017-02-11 15:16 by arigo, last changed 2019-08-24 10:29 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1937 closed arigo, 2017-06-04 06:30
PR 15264 closed epicfaace, 2019-08-14 00:42
PR 15269 closed serhiy.storchaka, 2019-08-14 08:18
PR 15454 merged miss-islington, 2019-08-24 09:49
Messages (19)
msg287601 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2017-02-11 15:16
The documentation on the hash randomization says that date, time and datetime have a hash based on strings, that is therefore nondeterministic in several runs of Python.  I may either be missing a caveat, or the actual implementation does not follow its promise in case a timezone is attached to the datetime or time object:

~/svn/python/3.7-debug/python -c "import datetime;print(hash(d                      atetime.datetime(2016,10,10,0,0,0,0,datetime.timezone(datetime.timedelta(0, 36000)))))"
(this gives -6021186165085109055 all the time)

~/svn/python/3.7-debug/python -c "import datetime;print(hash(datetime.time(0,0,0,0, datetime.timezone(datetime.timedelta(0, 36000)))))"
(this gives -3850122659820237607 all the time)
msg287602 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-02-11 15:25
Only the hash of str and bytes are randomized/ The types date, datetime and time are not subject to hash randomization. Same for int, float, bool and None.
msg287607 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2017-02-11 15:51
That's not what the docs say.  E.g.: https://docs.python.org/3/reference/datamodel.html#object.__hash__ says

    By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

Morever, this command really prints changing numbers:

~/svn/python/3.7-debug/python -c "import datetime;print(hash(d                      atetime.datetime(2016,10,10,0,0,0,0)))"
msg287609 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-02-11 16:06
I only checked Python 2.7. For Python 3.x it's a bit more complicated:

timedelta: PyObject_Hash(), always the same hash value
date: _Py_HashBytes(), always a randomized hash value
time: _Py_HashBytes() for offset = None, PyObject_Hash() for offset != 0
datetime: _Py_HashBytes() for offset = None, PyObject_Hash() for offset != 0
timezone: PyObject_Hash() (inherited from object)

I don't know why the datetime module doesn't use hash randomization for datetime and time objects with an offset. MAL is the master of (date)time. He might know.
msg295165 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-06-05 04:58
Rather than changing the documentation, I would prefer to have this fixed for date/time/datetiem unless MAL has a reason not to make the change.

For timezone, I don't think we really care.  

For timedelta, it is reasonable to always be the same hash.  It isn't conceptually different from the float given by s.total_seconds() or a tuple of (days, seconds, microseconds).
msg295167 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-05 05:23
It looks to me that this difference is not intentional. This is just a consequence of the fact that __hash__ functions for some of these objects use the hash of the pickle state which is a bytes object.
msg349636 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-08-14 03:30
I've added a PR which should fix this. Do you think the documentation should also be updated to change "By default, the :meth:`__hash__` values of str, bytes and datetime objects are "salted" with an unpredictable random value." to "By default, the :meth:`__hash__` values of str, bytes, datetime.date, datetime.time and datetime.datetime objects are "salted" with an unpredictable random value."?

Technically, there are other objects in datetime such as datetime.relativedelta whose hash values are _not_ non-deterministic.
msg349653 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-14 06:45
Other option is to remove a note about datetime hash. It is an implementation detail.

There are other objects with nondeterministic hash, for example tuples containing strings, but we should not document this explicitly or make the hash of all tuples nondeterministic.
msg349731 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-08-14 18:07
Why is it ok for certain hashes (such as tuples) to be not non-deterministic, while other hashes (such as datetime) need to be non-deterministic?
msg349750 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2019-08-14 21:04
PEP 456 explains why hash of str and bytes must be randomized.

I don't know any reason why hash of datetime objects must be randomized. They can be deterministic like floats and ints.
msg349780 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-08-14 23:49
Randomizing the hash of datetime objects was first proposed in https://bugs.python.org/issue13703#msg151796.

For the same reasons as str and bytes are non-deterministically hashed in in PEP 456, shouldn't numerics, datetime objects, and tuples be non-deterministically hashed as well? This is for the reason that they can all be used as dictionary keys (additionally, hash(n) begins to repeat when n is a large enough number) -- so it seems like they are also susceptible to the hash collision DoS attacks.
msg349801 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-08-15 10:50
> shouldn't numerics, datetime objects, and tuples be non-deterministically hashed as well? [...]

Making the numeric hash non-predictable while maintaining its current properties would be difficult.

But fortunately, I don't think it's necessary. IIUC, the original DOS attack involved carefully-crafted collections of keywords and values being passed to a website backend, with that backend then putting those keywords and values into a Python dictionary. I'd expect that there are *way* more places where a dict is being constructed with string keys in this way than with numeric keys. In fact, I think it's reasonable to assume that there are no websites vulnerable to a DOS via *numeric* hash collisions until we see evidence otherwise.

FWIW, I'd expect the same to be true for datetime objects; I'm not sure why they were originally included. IANASE, but it seems to me that covering Unicode strings and bytestrings should be enough in practice.
msg349828 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-08-15 21:19
> Making the numeric hash non-predictable while maintaining its current properties would be difficult.

Why so?

> In fact, I think it's reasonable to assume that there are no websites vulnerable to a DOS via *numeric* hash collisions until we see evidence otherwise. I'd expect that there are *way* more places where a dict is being constructed with string keys in this way than with numeric keys.

That's true, but why do we restrict ourselves to websites? This is how I see it: As a Python developer, it seems like my program is immune to hash collision DoS if I use strings/bytes as dictionary keys, but *not* if my keys, say, are tuples of strings. Why not make the hash non-predictable for all builtin types by default?
msg349853 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-08-16 08:54
> Why so?

Python's hash needs to obey the invariant that `hash(x) == hash(y)` for any two hashable objects with `x == y`. That makes life particularly hard for numeric types, because there are a number of *different* numeric types where equality can hold across those types. This includes not just built-in types, but third party types as well (think of NumPy, gmpy2, SymPy, and other libraries providing numbers that need to compare equal to Python numbers with the same value).

So for example, `hash(1.5)`,  `hash(Decimal("1.5"))`, `hash(Fraction(3, 2))`, `hash(1.5 + 0j)`, `hash(numpy.float32(1.5))`, `hash(bigfloat.BigFloat(1.5, precision=200))` must _all_ be equal to one another within a single running Python process.

Moreover, hash computation needs to be efficient for common types like floats and integers, while also not being impossibly slow for other types. (Early versions of Decimal's hash did a check to see whether the Decimal instance was an exact integer, and if so, converted that Decimal instance to an integer before taking its hash. But doing that with `Decimal(1e999999)` doesn't go well.)

It would definitely be *possible* to:

- compute a hash in a cross-type-compatible way
- do some sort of uniform post-processing of that hash, incorporating information from a per-process random salt

The operations described by Melissa O'Neill in her PCG paper give ideas for some ways to do such post-processing: regard the hash and the salt as together forming a 128-bit integer, and then collapse that 128-integer down to a 64-bit integer using one of the PCG post-processing methods. Note that as far as I know there's been no validation of those methods from a cryptographic (rather than a statistical) perspective.

However, it would be significant work, be disruptive not just to CPython, but to 3rd party packages and to other Python implementations, would slow down common hashing operations, and would increase the amount and the complexity of code that has to be maintained into the future.

So there's no shortage of reasons *not* to change the numeric hash. What I think we're lacking is a single reason *to* change it. Can you give a plausible example of a situation where the predictability of the numeric hash can lead to possible security issues?

See also the recent issue #37807.

> but *not* if my keys, say, are tuples of strings

Bad example. :-) The hash of a tuple is based on the hash of its contents. So if those contents are strings, the tuple benefits from the string hash randomization.

mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-824966383135019564
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-5971473524922642515
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
5384650403450490974
msg349865 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-08-16 14:20
I'm with Mark:  leave numeric hashes alone.  There's no reason to change them, and in addition to what Mark wrote it's a positively Good Thing that `hash(i) == i` for all sufficiently small ints.  Not only is that efficient to compute, it guarantees there are no collisions _at all_ in the common case of a dict indexed by a contiguous range of integers.

The _purpose_ of `hash()` in Python isn't to create an illusion of randomness, it's to support efficient dicts and sets.  Mucking with string hashes was a pragmatic hack to alleviate concerns about DOS attacks specific to string-indexed dicts.  A better technical solution to that would have been to introduce a different flavor of dict with guaranteed good worst-case behaviors, but, pragmatically, a great many users would never realize that's what they really wanted, and it wouldn't have helped pre-existing code at all.

But there's no reason to spread that hack beyond the use cases that needed it, and much reason not to.
msg350285 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-08-23 13:54
Makes sense, thanks for the explanation. The risk is that if there is code that, say, converts a POST dictionary to a dictionary with numeric keys, that code could be exploited. Creating a non-deterministic hash doesn't necessarily preclude hash(x) = x for a small enough x either. 

Given that other libraries (NumPy, etc.) rely on the numeric hash staying the way it is, it makes sense to keep it as it is. Since when did something that seems at first glance to be an implementation detail become more like a backwards-incompatible API, though? (For example, the implementation of the numeric hash was changed without any backwards-compatibility issues in https://bugs.python.org/issue14621). Might there be a better way to clarify this distinction for other features in Python?

I think the way forward for this patch is to keep the datetime hash as it is, and remove "datetime" in the parts of documentation that enumerate which data types have non-deterministic hashes.
msg350297 - (view) Author: Ashwin Ramaswami (epicfaace) * Date: 2019-08-23 14:18
Oh, that PR is already there in PR 15269, great!
msg350357 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-24 09:49
New changeset e9c90aa43144b0be1e4e393e8cb549573437a5da by Serhiy Storchaka in branch 'master':
bpo-29535: Remove promize about hash randomization of datetime objects. (GH-15269)
https://github.com/python/cpython/commit/e9c90aa43144b0be1e4e393e8cb549573437a5da
msg350361 - (view) Author: miss-islington (miss-islington) Date: 2019-08-24 10:19
New changeset 076d0b9f5def35aeb0f8e8aadf658dc35aace81d by Miss Islington (bot) in branch '3.8':
bpo-29535: Remove promize about hash randomization of datetime objects. (GH-15269)
https://github.com/python/cpython/commit/076d0b9f5def35aeb0f8e8aadf658dc35aace81d
History
Date User Action Args
2019-08-24 10:29:35serhiy.storchakasetstatus: open -> closed
assignee: docs@python
type: security -> behavior
components: + Documentation
versions: - Python 3.5, Python 3.6, Python 3.7
nosy: + docs@python

resolution: fixed
stage: patch review -> resolved
2019-08-24 10:19:53miss-islingtonsetnosy: + miss-islington
messages: + msg350361
2019-08-24 09:49:43miss-islingtonsetpull_requests: + pull_request15148
2019-08-24 09:49:30serhiy.storchakasetmessages: + msg350357
2019-08-23 14:18:36epicfaacesetmessages: + msg350297
2019-08-23 13:54:35epicfaacesetmessages: + msg350285
2019-08-21 11:10:34vstinnersetnosy: + vstinner
2019-08-16 14:20:11tim.peterssetnosy: + tim.peters
messages: + msg349865
2019-08-16 08:54:42mark.dickinsonsetmessages: + msg349853
2019-08-15 21:19:36epicfaacesetmessages: + msg349828
2019-08-15 10:50:14mark.dickinsonsetnosy: + mark.dickinson
messages: + msg349801
2019-08-14 23:49:31epicfaacesetnosy: + dmalcolm
messages: + msg349780
2019-08-14 21:04:49christian.heimessetmessages: + msg349750
2019-08-14 18:07:20epicfaacesetmessages: + msg349731
2019-08-14 08:18:40serhiy.storchakasetnosy: + benjamin.peterson
2019-08-14 08:18:11serhiy.storchakasetpull_requests: + pull_request14991
2019-08-14 06:45:26serhiy.storchakasetmessages: + msg349653
2019-08-14 03:30:11epicfaacesetmessages: + msg349636
2019-08-14 00:42:30epicfaacesetkeywords: + patch
stage: patch review
pull_requests: + pull_request14983
2019-08-13 20:42:48arigosetnosy: - arigo
2019-08-13 19:23:07epicfaacesetnosy: + epicfaace

versions: + Python 3.8, Python 3.9
2017-06-05 05:23:09serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg295167
2017-06-05 04:58:10rhettingersetnosy: + rhettinger
messages: + msg295165
2017-06-04 06:30:00arigosetpull_requests: + pull_request2016
2017-02-11 16:08:41serhiy.storchakasetnosy: + belopolsky

versions: + Python 3.6
2017-02-11 16:06:01christian.heimessetnosy: + lemburg
messages: + msg287609
2017-02-11 15:51:02arigosetmessages: + msg287607
2017-02-11 15:25:54christian.heimessetnosy: + christian.heimes
messages: + msg287602
2017-02-11 15:16:53arigocreate