Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random.seed, whose purpose is purportedly determinism, behaves non-deterministically with strings due to hash randomization #71893

Closed
glyph mannequin opened this issue Aug 8, 2016 · 15 comments
Assignees
Labels
stdlib Python modules in the Lib dir tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@glyph
Copy link
Mannequin

glyph mannequin commented Aug 8, 2016

BPO 27706
Nosy @rhettinger, @terryjreedy, @vstinner, @glyph, @Lukasa, @Shredder13
Files
  • issue27706.patch
  • issue27706.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2016-08-31.22:07:06.003>
    created_at = <Date 2016-08-08.00:15:14.552>
    labels = ['tests', 'type-bug', 'library']
    title = 'Random.seed, whose purpose is purportedly determinism, behaves non-deterministically with strings due to hash randomization'
    updated_at = <Date 2016-08-31.22:07:06.002>
    user = 'https://github.com/glyph'

    bugs.python.org fields:

    activity = <Date 2016-08-31.22:07:06.002>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2016-08-31.22:07:06.003>
    closer = 'rhettinger'
    components = ['Library (Lib)', 'Tests']
    creation = <Date 2016-08-08.00:15:14.552>
    creator = 'glyph'
    dependencies = []
    files = ['44308', '44309']
    hgrepos = []
    issue_num = 27706
    keywords = ['patch']
    message_count = 15.0
    messages = ['272137', '272141', '272418', '272712', '272928', '272972', '272973', '272987', '272993', '274066', '274067', '274069', '274074', '274076', '274077']
    nosy_count = 7.0
    nosy_names = ['rhettinger', 'terry.reedy', 'vstinner', 'glyph', 'python-dev', 'Lukasa', 'Nofar Schnider']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue27706'
    versions = ['Python 3.5', 'Python 3.6']

    @glyph
    Copy link
    Mannequin Author

    glyph mannequin commented Aug 8, 2016

    The purpose of 'seeding' a random number generator is usually to supply a deterministic sequence of values starting from a known point. This works fine if you seed random.Random with an integer. Often (for example, see Minecraft's map generation interface) one wants to begin with a human-memorable string as the seed, and superficially it would seem that passing a string to Random.seed would serve exactly this use-case. In fact in its original incarnation it did.

    However, since the introduction of PYTHONHASHSEED in 2.6.8, it's possible that strings now hash to different values, and on 3.2+, they'll _always_ hash to different values unless otherwise configured (which, as per the reason for introducing this feature in the first place, is a security flaw).

    Right now the way to work around this is to get some deterministic hash from your string; one mechanism being a truncated SHA256 hash, for example, like this:

    Random(struct.unpack("!I", sha256(seed.encode("utf-8")).digest()[:4])[0])

    but this strikes me as an obscure trick to require of someone just trying to get their D&D character generator to produce the same values twice in a row for unit testing.

    I'm not sure what the resolution should be, but I figured I should report this since I tripped over it.

    @glyph glyph mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Aug 8, 2016
    @rhettinger rhettinger self-assigned this Aug 8, 2016
    @rhettinger
    Copy link
    Contributor

    Thanks for the bug report. This is a case of something that used to work fine but was affected by an incidental change elsewhere.

    To support the use case for deterministic sequences of values starting from a known point, the docs promise, "If a new seeding method is added, then a backward compatible seeder will be offered. The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed." (See https://docs.python.org/3/library/random.html#notes-on-reproducibility )

    The resolution is to have the random module (line 327 in Modules/_randommodule.c) use a new _PyObject_Hash() function that deterministically matches what the old PyObject_Hash() function used to do.

    Marking this as "needs patch" and saving it for Nofar Schnider to work on (she's an aspiring core dev).

    @Shredder13
    Copy link
    Mannequin

    Shredder13 mannequin commented Aug 11, 2016

    On it!

    @terryjreedy
    Copy link
    Member

    I agree with the value of a Reproducibility guarantee. (The new section appears to have been added -- by Raymond -- in 3.2.) For instance, people posting play-through videos using reproducible random maps typically post the 'seed' and I have seen memorable phrases rather than ints used. I can imagine myself publishing seeds in other contexts.

    In 3.2, seed gained a 2nd parameter -- 'version'. "With version 2 (the default), a str, bytes, or bytearray object gets converted to an int and all of its bits are used." For this case, hashing is not an issue. But 'conversion to int' is. Did it change with the introduction of FSR in 3.3? It certainly should be frozen now, and the fact noted in the code.

    For other non-int objects, a hashable is required. (I expect anything other than int or string-like to be rare.) The doc does not say so (it should), but the dosctring does and experiment with [] confirms.

    "With version 1, the hash() of a is used instead."

    For hashed objects, whether version is 1 or 2, I guess the best we can do is to restore the fixed hash once used.

    For a fixed sequence of outputs, both seed and rng have to be fixed. 2.7 still has WichmannHill for this reason. It is gone in 3.x. It the rng is significantly changed (different sequence for the same seed), I believe the seed version should be changed also.

    @vstinner
    Copy link
    Member

    "Right now the way to work around this is to get some deterministic hash from your string; one mechanism being a truncated SHA256 hash, ..."

    It looks like I missed something. Lib/random.py already computes the SHA-512 hash of you pass a string to random.Random constructor?

    Using a string as a seed for random.Random already works as expected in Python 3.6:

    haypo@selma$ python3 -c "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    8755240310 6396067846301608395
    haypo@selma$ python3 -c "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    8755240310 -1771227904188177035
    haypo@selma$ python3 -c "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    8755240310 1726464324144904308
    haypo@selma$ python3 -c "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    8755240310 2069899884777593571
    haypo@selma$ python3 -c "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    8755240310 -8244933646981095152
    haypo@selma$ python3 -c "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    8755240310 -3269879388324739111

    It was already the case in Python 2.7.

    @glyph
    Copy link
    Mannequin Author

    glyph mannequin commented Aug 17, 2016

    It does seem to be stable on python 3, but on python 2.7 it's definitely a problem:

    $ python -Rc "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    ('9553343809', -1972659830997666042)
    $ python -Rc "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    ('5519010739', 5520208254012363023)
    $ python -Rc "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    ('7519888435', 3560222494758569319)
    $ python -Rc "import random; r=random.Random('abc'); print(''.join(map(str, (r.randrange(10) for x in range(10)))), hash('abc'))"
    ('9612648103', 4134882069837806740)

    @glyph
    Copy link
    Mannequin Author

    glyph mannequin commented Aug 17, 2016

    Changing the affected version to just 2.7.

    @vstinner
    Copy link
    Member

    Changing the affected version to just 2.7.

    Oh. The request looks like an enhancement. The problem is that if you add a new feature in Python 2.7.n+1, it's not available on Python 2.7.n. Support 2.7.n as well, you have to backport the code in your application.

    I'm not sure that it's worth to add such enhancement to the random at this point in Python 2.

    I suggest you to either upgrade to Python 3 (hello, Python 3!) or implement the SHA512 in your application. I expect that random.seed() in only called at one or maybe two places, so it shouldn't be hard to patch your code ;-)

    In short, I suggest to close the issue as wont fix.

    @glyph
    Copy link
    Mannequin Author

    glyph mannequin commented Aug 17, 2016

    For what it's worth, I don't much care whether this is fixed or not; I ended up wanting to leak less information from the RNG output anyway so I wrote this:

    https://gist.github.com/glyph/ceca96100a3049fefea6f2035abbd9ea

    but I felt like it should be reported.

    @Shredder13
    Copy link
    Mannequin

    Shredder13 mannequin commented Aug 31, 2016

    Adding the patch with seed fix for version=1 and tests (test_random).

    @Shredder13 Shredder13 mannequin added the tests Tests in the Lib/test dir label Aug 31, 2016
    @vstinner
    Copy link
    Member

    Nofar Schnider: "versions: + Python 3.5, - Python 2.7"

    I don't get it. This issue is specific to Python 2.7, no?

    bpo-27706.patch

    If you want to backport this feature from Python 3, I suggest to reuse the same code (so SHA 512). You might get the same random sequences on Python 2 and Python 3, but I don't think that it's matter :-) It's just that I expect that SHA-512 keeps more bits of entropy, than Python 2 hash function.

    @rhettinger
    Copy link
    Contributor

    For 2.7, we're going to add a note to the docs. For 3.5 and 3.6, we're adjusting the version==1 logic to meet the documented guarantee.

    @Shredder13
    Copy link
    Mannequin

    Shredder13 mannequin commented Aug 31, 2016

    fixed indentation

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 31, 2016

    New changeset 1f37903e6040 by Raymond Hettinger in branch '2.7':
    Issue bpo-27706: Document that random.seed() is non-deterministic when PYTHONHASHSEED is enabled
    https://hg.python.org/cpython/rev/1f37903e6040

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 31, 2016

    New changeset 5ae941fef3be by Raymond Hettinger in branch '3.5':
    Issue bpo-27706: Fix regression in random.seed(somestr, version=1)
    https://hg.python.org/cpython/rev/5ae941fef3be

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants