classification
Title: random.seed(tuple) uses the randomized hash function and so is not reproductible
Type: behavior Stage: resolved
Components: Documentation, Extension Modules Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: docs@python, johnnyd, mark.dickinson, poddster, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-01-15 09:08 by johnnyd, last changed 2019-08-22 16:20 by rhettinger. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15382 merged rhettinger, 2019-08-22 09:03
Messages (9)
msg309956 - (view) Author: Johnny Dude (johnnyd) Date: 2018-01-15 09:08
When using a tuple that include a string the results are not consistent when invoking a new interpreter or process.

For example executing the following on a linux machine will yield different results:
python3.6 -c 'import random; random.seed(("a", 1)); print(random.random())"

Please note that the doc string of random.seed states: "Initialize internal state from hashable object."

Python documentation does not. (https://docs.python.org/3.6/library/random.html#random.seed)

This is very confusing, I hope you can fix the behavior, not the doc string.
msg309957 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-01-15 09:13
random.seed(str) uses:

        if version == 2 and isinstance(a, (str, bytes, bytearray)):
            if isinstance(a, str):
                a = a.encode()
            a += _sha512(a).digest()
            a = int.from_bytes(a, 'big')

Whereas for other types, random.seed(obj) uses hash(obj), and hash is randomized by default in Python 3.

Yeah, the random.seed() documentation should describe the implementation and explain that hash(obj) is used and that the hash function is randomized by default:
https://docs.python.org/dev/library/random.html#random.seed
msg310009 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-01-15 18:46
> This is very confusing, I hope you can fix the behavior, not the doc string.

I'll fix the docstring to make it more specific.

We really don't want to use hash(obj) because it produces too few bits of entropy.
msg310019 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-01-15 21:49
Maybe deprecate using a hash?
msg320360 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-06-24 06:59
> Maybe deprecate using a hash?

Any deprecation will likely break some existing code, but it would be nice to restrict inputs types to int, float, bytes, bytearray, or str.  Then we could remove all reference to hashing.
msg320361 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-24 07:08
This is what I meant. Emit a deprecation warning for input types other than explicitly supported types (but I didn't think about float), and raise an error in future.
msg320383 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-06-24 19:44
I'm thinking of something like this:

$ git diff
diff --git a/Lib/random.py b/Lib/random.py
index 1e0dcc87ed..f479e66ada 100644
--- a/Lib/random.py
+++ b/Lib/random.py
@@ -136,12 +136,17 @@ class Random(_random.Random):
             x ^= len(a)
             a = -2 if x == -1 else x

-        if version == 2 and isinstance(a, (str, bytes, bytearray)):
+        elif version == 2 and isinstance(a, (str, bytes, bytearray)):
             if isinstance(a, str):
                 a = a.encode()
             a += _sha512(a).digest()
             a = int.from_bytes(a, 'big')

+        elif not isinstance(a, (type(None), int, float, str, bytes, bytearray)):
+            _warn('Seeding based on hashing is deprecated.\n'
+                  'The only supported seed types are None, int, float, '
+                  'str, bytes, and bytearray.', DeprecationWarning, 2)
+
         super().seed(a)
         self.gauss_next = None
msg321759 - (view) Author: Lee Griffiths (poddster) Date: 2018-07-16 19:25
a) This below issue added doc to py2.7 that calls out PYTHONHASHSEED, but py doesn't currently contain those words

https://bugs.python.org/issue27706

It'd be useful to have the something whether the "behaviour" is fixed or not, as providing other objects (like a tuple) will still be non-deterministic.

b) I don't know if this is the correct issue to heap this on, but I think it might as you're looking at changing the seed function? 

The documentation for `object.__hash__` calls out `str`, `bytes` and `datetime` as being affected by `PYTHONHASHSEED`. Doesn't it seem odd that there's a workaround in the seed function for str and bytes, but not for datetime?

https://docs.python.org/3/reference/datamodel.html#object.__hash__

I mainly point this out as seeding random with the current date/time is idiomatic in many languages and environments (usually used when you log the seed to be able to recreate things later, or just blindly copying the historical use `srand(time(NULL))` from C programs!). Anyone shoving a datetime straight into seed() is going to find it non-deterministic and might not understand why, or even notice, especially as the documentation for seed() doesn't call this out. 

Those "in the know" will get a unix timestamp out of the datetime and put that in seed instead, but I feel that falls under the same argument as users-in-the-know SHA512ing a string, mentioned above, which is undesirable and apparently something the function should implement and not users.

Would it be wise for datetime to have a specific implementation as well?
msg350209 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-22 16:19
New changeset d0cdeaab76fef8a6e5a04665df226b6659111e4e by Raymond Hettinger in branch 'master':
bpo-32554: Deprecate hashing arbitrary types in random.seed() (GH-15382)
https://github.com/python/cpython/commit/d0cdeaab76fef8a6e5a04665df226b6659111e4e
History
Date User Action Args
2019-08-22 16:20:08rhettingersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-08-22 16:19:39rhettingersetmessages: + msg350209
2019-08-22 09:03:27rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request15092
2018-07-16 19:25:12poddstersetnosy: + poddster
messages: + msg321759
2018-06-24 19:44:55rhettingersetmessages: + msg320383
2018-06-24 07:08:15serhiy.storchakasetmessages: + msg320361
2018-06-24 06:59:04rhettingersetmessages: + msg320360
2018-01-15 21:49:39serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg310019
2018-01-15 18:46:17rhettingersetmessages: + msg310009
2018-01-15 18:45:16rhettingersetmessages: - msg310006
2018-01-15 18:41:42rhettingersetmessages: + msg310006
2018-01-15 18:19:48rhettingersetassignee: docs@python -> rhettinger
2018-01-15 09:13:51vstinnersettitle: random seed is not consistent when using tuples with a str element -> random.seed(tuple) uses the randomized hash function and so is not reproductible
2018-01-15 09:13:25vstinnersetnosy: + rhettinger, docs@python, vstinner, mark.dickinson
messages: + msg309957

assignee: docs@python
components: + Documentation
2018-01-15 09:08:22johnnydcreate