Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random.seed does not affect string hash randomization leading to non-intuitive results #84505

Closed
kwikwag mannequin opened this issue Apr 18, 2020 · 10 comments
Closed

Random.seed does not affect string hash randomization leading to non-intuitive results #84505

kwikwag mannequin opened this issue Apr 18, 2020 · 10 comments
Assignees
Labels
3.9 only security fixes docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@kwikwag
Copy link
Mannequin

kwikwag mannequin commented Apr 18, 2020

BPO 40325
Nosy @tim-one, @rhettinger, @masklinn, @remilapeyre, @kwikwag
PRs
  • bpo-40325: Deprecate set object support in random.sample() #19591
  • Mention hash randomization in random lib bpo-40325 #19596
  • Files
  • test.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2020-04-19.07:37:40.308>
    created_at = <Date 2020-04-18.21:25:19.615>
    labels = ['type-bug', 'library', '3.9', 'docs']
    title = 'Random.seed does not affect string hash randomization leading to non-intuitive results'
    updated_at = <Date 2020-10-15.13:21:41.463>
    user = 'https://github.com/kwikwag'

    bugs.python.org fields:

    activity = <Date 2020-10-15.13:21:41.463>
    actor = 'xmorel'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2020-04-19.07:37:40.308>
    closer = 'rhettinger'
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2020-04-18.21:25:19.615>
    creator = 'Yuval S'
    dependencies = []
    files = ['49075']
    hgrepos = []
    issue_num = 40325
    keywords = ['patch']
    message_count = 10.0
    messages = ['366741', '366742', '366746', '366747', '366748', '366749', '366759', '366760', '366763', '378683']
    nosy_count = 6.0
    nosy_names = ['tim.peters', 'rhettinger', 'xmorel', 'docs@python', 'remi.lapeyre', 'Yuval S']
    pr_nums = ['19591', '19596']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue40325'
    versions = ['Python 3.9']

    @kwikwag
    Copy link
    Mannequin Author

    kwikwag mannequin commented Apr 18, 2020

    The following code gives different results on each run, even though "random.seed()" is used:

    >> import random
    >> random.seed(6)
    >> x = set(str(i) for i in range(500))
    >> print(random.sample(x, 1))

    presumably because of string hash randomization (see also bpo-27706),
    unless "PYTHONHASHSEED" is set.

    However, this is non-intuitive, especially as this random aspect of Python is not mentioned in Notes on Reproducability <https://docs.python.org/3/library/random.html#notes-on-reproducibility>_.

    I would suggest this is either fixed (using the provided seed for string hash randomization as well) or documented.

    @kwikwag kwikwag mannequin assigned docspython Apr 18, 2020
    @kwikwag kwikwag mannequin added 3.7 (EOL) end of life 3.8 only security fixes docs Documentation in the Doc dir stdlib Python modules in the Lib dir labels Apr 18, 2020
    @kwikwag kwikwag mannequin assigned docspython Apr 18, 2020
    @kwikwag kwikwag mannequin added 3.7 (EOL) end of life 3.8 only security fixes docs Documentation in the Doc dir stdlib Python modules in the Lib dir labels Apr 18, 2020
    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented Apr 18, 2020

    String hash randomization is a security feature so it may be better to not disable it unless explicitly asked for. Maybe a note in random's documentation could be added?

    @rhettinger
    Copy link
    Contributor

    I'm going to deprecate the support for sets. It was a design mistake at several levels. Better to just remove it.

    @rhettinger rhettinger added 3.9 only security fixes and removed 3.7 (EOL) end of life 3.8 only security fixes labels Apr 19, 2020
    @rhettinger rhettinger assigned rhettinger and unassigned docspython Apr 19, 2020
    @rhettinger rhettinger added type-bug An unexpected behavior, bug, or error 3.9 only security fixes and removed 3.7 (EOL) end of life 3.8 only security fixes labels Apr 19, 2020
    @rhettinger rhettinger assigned rhettinger and unassigned docspython Apr 19, 2020
    @rhettinger rhettinger added the type-bug An unexpected behavior, bug, or error label Apr 19, 2020
    @tim-one
    Copy link
    Member

    tim-one commented Apr 19, 2020

    Raymond, I think that removing sample(set) support is a different issue. This report will just change its final example line to

    >> print(random.sample(list(x), 1))

    or

    >> print(random.sample(tuple(x), 1))

    and have the same complaint.

    @rhettinger
    Copy link
    Contributor

    I think the thing we can fix is the automatic set support which is intrinsically broken with respect to reproducibility and which was likely not a good idea to begin with (because it adds an implicit and possibly unexpected O(n) conversion step and because it doesn't make the API for choice()).

    If someone converts a set to a list or tuple upstream from sample(), there isn't much we can do about it. That wouldn't be much different from list(s)[0] giving different output from run to run. That is a general FAQ and would apply to just about anything that takes a sequence or iterator to run.

    @tim-one
    Copy link
    Member

    tim-one commented Apr 19, 2020

    Yup, I agree sample(set) is a misfeature.

    @rhettinger
    Copy link
    Contributor

    New changeset 4fe0020 by Raymond Hettinger in branch 'master':
    bpo-40325: Deprecate set object support in random.sample() (GH-19591)
    4fe0020

    @rhettinger
    Copy link
    Contributor

    Yuval, thanks for the report.

    @kwikwag
    Copy link
    Mannequin Author

    kwikwag mannequin commented Apr 19, 2020

    Thank you for the attention and the quick fix. However, the current documentation for "Notes on Reproducibility" should still address this issue of hash randomization. Not only sample is affected by this, but any code that combines strings (or bytes or datetime) with hash and random, e.g.

    >> import random
    >> random.seed(6)
    >> a = list(set(str(i) for i in range(500)))
    >> print(a[int(random.random() * 500)])

    or, this

    >> import random
    >> import datetime
    >> random.seed(6)
    >> print(random.choice(range(hash(datetime.datetime(2000,1,1)) % 100)))

    will still produce non-reproducible results even after the fix. Here is my suggestion for documentation:

    Hash randomization, which is enabled by default since version 3.3, is not affected by random.seed(). For this reason, code that relies on string hashes, such as code that relies on the ordering of set or dict, might be non-reproducible, unless string hash randomization is disabled or seeded (see: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED).

    My vote would be to keep hash randomization ties to random.seed(), and this would make all use cases more predictable, as well as allow random.sample() to support set.

    @masklinn
    Copy link
    Mannequin

    masklinn mannequin commented Oct 15, 2020

    @rhettinger checking software against 3.9 there's a little issue with the way the check is done: if passed something which is both a sequence and a set (e.g. an ordered set), random.sample will trigger a warning, which I don't think is correct.

    Should I open a new issue for that? Fix seems simple: just move the check for _Set inside the check for _Sequence, and raise if that doesn't pass either.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants