Issue 29535: datetime hash is deterministic in some cases

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/73721

classification

Title:	datetime hash is deterministic in some cases
Type:	behavior	Stage:	resolved
Components:	Documentation	Versions:	Python 3.9, Python 3.8

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	belopolsky, benjamin.peterson, christian.heimes, dmalcolm, docs@python, epicfaace, lemburg, mark.dickinson, miss-islington, rhettinger, serhiy.storchaka, tim.peters, vstinner
Priority:	normal	Keywords:	patch

Created on 2017-02-11 15:16 by arigo, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 1937	closed	arigo, 2017-06-04 06:30
PR 15264	closed	epicfaace, 2019-08-14 00:42
PR 15269	closed	serhiy.storchaka, 2019-08-14 08:18
PR 15454	merged	miss-islington, 2019-08-24 09:49

Messages (19)
msg287601 - (view)	Author: Armin Rigo (arigo) *	Date: 2017-02-11 15:16
The documentation on the hash randomization says that date, time and datetime have a hash based on strings, that is therefore nondeterministic in several runs of Python. I may either be missing a caveat, or the actual implementation does not follow its promise in case a timezone is attached to the datetime or time object: ~/svn/python/3.7-debug/python -c "import datetime;print(hash(d atetime.datetime(2016,10,10,0,0,0,0,datetime.timezone(datetime.timedelta(0, 36000)))))" (this gives -6021186165085109055 all the time) ~/svn/python/3.7-debug/python -c "import datetime;print(hash(datetime.time(0,0,0,0, datetime.timezone(datetime.timedelta(0, 36000)))))" (this gives -3850122659820237607 all the time)
msg287602 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2017-02-11 15:25
Only the hash of str and bytes are randomized/ The types date, datetime and time are not subject to hash randomization. Same for int, float, bool and None.
msg287607 - (view)	Author: Armin Rigo (arigo) *	Date: 2017-02-11 15:51
That's not what the docs say. E.g.: https://docs.python.org/3/reference/datamodel.html#object.__hash__ says By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python. Morever, this command really prints changing numbers: ~/svn/python/3.7-debug/python -c "import datetime;print(hash(d atetime.datetime(2016,10,10,0,0,0,0)))"
msg287609 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2017-02-11 16:06
I only checked Python 2.7. For Python 3.x it's a bit more complicated: timedelta: PyObject_Hash(), always the same hash value date: _Py_HashBytes(), always a randomized hash value time: _Py_HashBytes() for offset = None, PyObject_Hash() for offset != 0 datetime: _Py_HashBytes() for offset = None, PyObject_Hash() for offset != 0 timezone: PyObject_Hash() (inherited from object) I don't know why the datetime module doesn't use hash randomization for datetime and time objects with an offset. MAL is the master of (date)time. He might know.
msg295165 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2017-06-05 04:58
Rather than changing the documentation, I would prefer to have this fixed for date/time/datetiem unless MAL has a reason not to make the change. For timezone, I don't think we really care. For timedelta, it is reasonable to always be the same hash. It isn't conceptually different from the float given by s.total_seconds() or a tuple of (days, seconds, microseconds).
msg295167 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-05 05:23
It looks to me that this difference is not intentional. This is just a consequence of the fact that __hash__ functions for some of these objects use the hash of the pickle state which is a bytes object.
msg349636 - (view)	Author: Ashwin Ramaswami (epicfaace) *	Date: 2019-08-14 03:30
I've added a PR which should fix this. Do you think the documentation should also be updated to change "By default, the :meth:`__hash__` values of str, bytes and datetime objects are "salted" with an unpredictable random value." to "By default, the :meth:`__hash__` values of str, bytes, datetime.date, datetime.time and datetime.datetime objects are "salted" with an unpredictable random value."? Technically, there are other objects in datetime such as datetime.relativedelta whose hash values are _not_ non-deterministic.
msg349653 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-08-14 06:45
Other option is to remove a note about datetime hash. It is an implementation detail. There are other objects with nondeterministic hash, for example tuples containing strings, but we should not document this explicitly or make the hash of all tuples nondeterministic.
msg349731 - (view)	Author: Ashwin Ramaswami (epicfaace) *	Date: 2019-08-14 18:07
Why is it ok for certain hashes (such as tuples) to be not non-deterministic, while other hashes (such as datetime) need to be non-deterministic?
msg349750 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2019-08-14 21:04
PEP 456 explains why hash of str and bytes must be randomized. I don't know any reason why hash of datetime objects must be randomized. They can be deterministic like floats and ints.
msg349780 - (view)	Author: Ashwin Ramaswami (epicfaace) *	Date: 2019-08-14 23:49
Randomizing the hash of datetime objects was first proposed in https://bugs.python.org/issue13703#msg151796. For the same reasons as str and bytes are non-deterministically hashed in in PEP 456, shouldn't numerics, datetime objects, and tuples be non-deterministically hashed as well? This is for the reason that they can all be used as dictionary keys (additionally, hash(n) begins to repeat when n is a large enough number) -- so it seems like they are also susceptible to the hash collision DoS attacks.
msg349801 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2019-08-15 10:50
> shouldn't numerics, datetime objects, and tuples be non-deterministically hashed as well? [...] Making the numeric hash non-predictable while maintaining its current properties would be difficult. But fortunately, I don't think it's necessary. IIUC, the original DOS attack involved carefully-crafted collections of keywords and values being passed to a website backend, with that backend then putting those keywords and values into a Python dictionary. I'd expect that there are way more places where a dict is being constructed with string keys in this way than with numeric keys. In fact, I think it's reasonable to assume that there are no websites vulnerable to a DOS via numeric hash collisions until we see evidence otherwise. FWIW, I'd expect the same to be true for datetime objects; I'm not sure why they were originally included. IANASE, but it seems to me that covering Unicode strings and bytestrings should be enough in practice.
msg349828 - (view)	Author: Ashwin Ramaswami (epicfaace) *	Date: 2019-08-15 21:19
> Making the numeric hash non-predictable while maintaining its current properties would be difficult. Why so? > In fact, I think it's reasonable to assume that there are no websites vulnerable to a DOS via numeric hash collisions until we see evidence otherwise. I'd expect that there are way more places where a dict is being constructed with string keys in this way than with numeric keys. That's true, but why do we restrict ourselves to websites? This is how I see it: As a Python developer, it seems like my program is immune to hash collision DoS if I use strings/bytes as dictionary keys, but not if my keys, say, are tuples of strings. Why not make the hash non-predictable for all builtin types by default?
msg349853 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2019-08-16 08:54
> Why so? Python's hash needs to obey the invariant that `hash(x) == hash(y)` for any two hashable objects with `x == y`. That makes life particularly hard for numeric types, because there are a number of different numeric types where equality can hold across those types. This includes not just built-in types, but third party types as well (think of NumPy, gmpy2, SymPy, and other libraries providing numbers that need to compare equal to Python numbers with the same value). So for example, `hash(1.5)`, `hash(Decimal("1.5"))`, `hash(Fraction(3, 2))`, `hash(1.5 + 0j)`, `hash(numpy.float32(1.5))`, `hash(bigfloat.BigFloat(1.5, precision=200))` must _all_ be equal to one another within a single running Python process. Moreover, hash computation needs to be efficient for common types like floats and integers, while also not being impossibly slow for other types. (Early versions of Decimal's hash did a check to see whether the Decimal instance was an exact integer, and if so, converted that Decimal instance to an integer before taking its hash. But doing that with `Decimal(1e999999)` doesn't go well.) It would definitely be possible to: - compute a hash in a cross-type-compatible way - do some sort of uniform post-processing of that hash, incorporating information from a per-process random salt The operations described by Melissa O'Neill in her PCG paper give ideas for some ways to do such post-processing: regard the hash and the salt as together forming a 128-bit integer, and then collapse that 128-integer down to a 64-bit integer using one of the PCG post-processing methods. Note that as far as I know there's been no validation of those methods from a cryptographic (rather than a statistical) perspective. However, it would be significant work, be disruptive not just to CPython, but to 3rd party packages and to other Python implementations, would slow down common hashing operations, and would increase the amount and the complexity of code that has to be maintained into the future. So there's no shortage of reasons not to change the numeric hash. What I think we're lacking is a single reason to change it. Can you give a plausible example of a situation where the predictability of the numeric hash can lead to possible security issues? See also the recent issue #37807. > but not if my keys, say, are tuples of strings Bad example. :-) The hash of a tuple is based on the hash of its contents. So if those contents are strings, the tuple benefits from the string hash randomization. mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))" -824966383135019564 mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))" -5971473524922642515 mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))" 5384650403450490974
msg349865 - (view)	Author: Tim Peters (tim.peters) *	Date: 2019-08-16 14:20
I'm with Mark: leave numeric hashes alone. There's no reason to change them, and in addition to what Mark wrote it's a positively Good Thing that `hash(i) == i` for all sufficiently small ints. Not only is that efficient to compute, it guarantees there are no collisions _at all_ in the common case of a dict indexed by a contiguous range of integers. The _purpose_ of `hash()` in Python isn't to create an illusion of randomness, it's to support efficient dicts and sets. Mucking with string hashes was a pragmatic hack to alleviate concerns about DOS attacks specific to string-indexed dicts. A better technical solution to that would have been to introduce a different flavor of dict with guaranteed good worst-case behaviors, but, pragmatically, a great many users would never realize that's what they really wanted, and it wouldn't have helped pre-existing code at all. But there's no reason to spread that hack beyond the use cases that needed it, and much reason not to.
msg350285 - (view)	Author: Ashwin Ramaswami (epicfaace) *	Date: 2019-08-23 13:54
Makes sense, thanks for the explanation. The risk is that if there is code that, say, converts a POST dictionary to a dictionary with numeric keys, that code could be exploited. Creating a non-deterministic hash doesn't necessarily preclude hash(x) = x for a small enough x either. Given that other libraries (NumPy, etc.) rely on the numeric hash staying the way it is, it makes sense to keep it as it is. Since when did something that seems at first glance to be an implementation detail become more like a backwards-incompatible API, though? (For example, the implementation of the numeric hash was changed without any backwards-compatibility issues in https://bugs.python.org/issue14621). Might there be a better way to clarify this distinction for other features in Python? I think the way forward for this patch is to keep the datetime hash as it is, and remove "datetime" in the parts of documentation that enumerate which data types have non-deterministic hashes.
msg350297 - (view)	Author: Ashwin Ramaswami (epicfaace) *	Date: 2019-08-23 14:18
Oh, that PR is already there in PR 15269, great!
msg350357 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-08-24 09:49
New changeset e9c90aa43144b0be1e4e393e8cb549573437a5da by Serhiy Storchaka in branch 'master': bpo-29535: Remove promize about hash randomization of datetime objects. (GH-15269) https://github.com/python/cpython/commit/e9c90aa43144b0be1e4e393e8cb549573437a5da
msg350361 - (view)	Author: miss-islington (miss-islington)	Date: 2019-08-24 10:19
New changeset 076d0b9f5def35aeb0f8e8aadf658dc35aace81d by Miss Islington (bot) in branch '3.8': bpo-29535: Remove promize about hash randomization of datetime objects. (GH-15269) https://github.com/python/cpython/commit/076d0b9f5def35aeb0f8e8aadf658dc35aace81d

History
Date	User	Action	Args
2022-04-11 14:58:43	admin	set	github: 73721
2019-08-24 10:29:35	serhiy.storchaka	set	status: open -> closed assignee: docs@python type: security -> behavior components: + Documentation versions: - Python 3.5, Python 3.6, Python 3.7 nosy: + docs@python resolution: fixed stage: patch review -> resolved
2019-08-24 10:19:53	miss-islington	set	nosy: + miss-islington messages: + msg350361
2019-08-24 09:49:43	miss-islington	set	pull_requests: + pull_request15148
2019-08-24 09:49:30	serhiy.storchaka	set	messages: + msg350357
2019-08-23 14:18:36	epicfaace	set	messages: + msg350297
2019-08-23 13:54:35	epicfaace	set	messages: + msg350285
2019-08-21 11:10:34	vstinner	set	nosy: + vstinner
2019-08-16 14:20:11	tim.peters	set	nosy: + tim.peters messages: + msg349865
2019-08-16 08:54:42	mark.dickinson	set	messages: + msg349853
2019-08-15 21:19:36	epicfaace	set	messages: + msg349828
2019-08-15 10:50:14	mark.dickinson	set	nosy: + mark.dickinson messages: + msg349801
2019-08-14 23:49:31	epicfaace	set	nosy: + dmalcolm messages: + msg349780
2019-08-14 21:04:49	christian.heimes	set	messages: + msg349750
2019-08-14 18:07:20	epicfaace	set	messages: + msg349731
2019-08-14 08:18:40	serhiy.storchaka	set	nosy: + benjamin.peterson
2019-08-14 08:18:11	serhiy.storchaka	set	pull_requests: + pull_request14991
2019-08-14 06:45:26	serhiy.storchaka	set	messages: + msg349653
2019-08-14 03:30:11	epicfaace	set	messages: + msg349636
2019-08-14 00:42:30	epicfaace	set	keywords: + patch stage: patch review pull_requests: + pull_request14983
2019-08-13 20:42:48	arigo	set	nosy: - arigo
2019-08-13 19:23:07	epicfaace	set	nosy: + epicfaace versions: + Python 3.8, Python 3.9
2017-06-05 05:23:09	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg295167
2017-06-05 04:58:10	rhettinger	set	nosy: + rhettinger messages: + msg295165
2017-06-04 06:30:00	arigo	set	pull_requests: + pull_request2016
2017-02-11 16:08:41	serhiy.storchaka	set	nosy: + belopolsky versions: + Python 3.6
2017-02-11 16:06:01	christian.heimes	set	nosy: + lemburg messages: + msg287609
2017-02-11 15:51:02	arigo	set	messages: + msg287607
2017-02-11 15:25:54	christian.heimes	set	nosy: + christian.heimes messages: + msg287602
2017-02-11 15:16:53	arigo	create