Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make hash(None) consistent among processes #63423

Closed
hongqn mannequin opened this issue Oct 11, 2013 · 10 comments
Closed

Make hash(None) consistent among processes #63423

hongqn mannequin opened this issue Oct 11, 2013 · 10 comments
Labels
type-feature A feature request or enhancement

Comments

@hongqn
Copy link
Mannequin

hongqn mannequin commented Oct 11, 2013

BPO 19224
Nosy @tim-one, @rhettinger, @pitrou, @vstinner, @tiran
Files
  • hash_of_none.patch (deprecated): make hash(None) always return 0
  • hash_of_none.patch: make hash(None) always return 1315925605
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-10-12.17:11:43.666>
    created_at = <Date 2013-10-11.04:03:40.946>
    labels = ['type-feature']
    title = 'Make hash(None) consistent among processes'
    updated_at = <Date 2013-10-12.17:18:48.989>
    user = 'https://bugs.python.org/hongqn'

    bugs.python.org fields:

    activity = <Date 2013-10-12.17:18:48.989>
    actor = 'christian.heimes'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-10-12.17:11:43.666>
    closer = 'rhettinger'
    components = []
    creation = <Date 2013-10-11.04:03:40.946>
    creator = 'hongqn'
    dependencies = []
    files = ['32043', '32044']
    hgrepos = []
    issue_num = 19224
    keywords = ['patch']
    message_count = 10.0
    messages = ['199439', '199440', '199441', '199442', '199448', '199452', '199453', '199538', '199603', '199606']
    nosy_count = 6.0
    nosy_names = ['tim.peters', 'rhettinger', 'pitrou', 'vstinner', 'christian.heimes', 'hongqn']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue19224'
    versions = ['Python 3.3', 'Python 3.4']

    @hongqn
    Copy link
    Mannequin Author

    hongqn mannequin commented Oct 11, 2013

    Integers, strings, and bool's hash are all consistent for processes of a same interpreter. However, hash(None) differs.

    $ python -c "print(hash(None))"
    272931276
    $ python -c "print(hash(None))"
    277161420

    It's wired and make difficulty for distributed systems partitioning data according hash of keys if the system wants the keys support None.

    This patch makes hash(None) always return 0 to resolve that problem. And it is used in DPark(Python clone of Spark, a MapReduce alike framework in Python, https://github.com/douban/dpark) to speed up portable hash (see line https://github.com/douban/dpark/blob/65a3ba857f11285667c61e2e134dacda44c13a2c/dpark/util.py#L47).

    davies.liu@gmail.com is the original author of this patch. All credit goes to him.

    @rhettinger
    Copy link
    Contributor

    Instead of 0, pick some large random number that is less likely to collide with other hashes such as hash(0).

    @tiran
    Copy link
    Member

    tiran commented Oct 11, 2013

    How about

    >>> (78 << 24) + (111 << 16) + (110 << 8) + 101
    1315925605

    The output of hash() is not guaranteed to be consistent between processes. The outcome depends on the hash randomization key, architecture, platform, Python version and perhaps other flags. 32bit builds of Python generated different hash() values than 64bit. The value might depend on endianess, too. (Not sure about that)

    @tiran tiran added the type-feature A feature request or enhancement label Oct 11, 2013
    @hongqn
    Copy link
    Mannequin Author

    hongqn mannequin commented Oct 11, 2013

    Return 1315925605 now :)

    @pitrou
    Copy link
    Member

    pitrou commented Oct 11, 2013

    Is this something we actually want to support officially? Many other types have non-repeatable hashes, e.g.:

    $ PYTHONHASHSEED=1 python3 -c "print(hash((lambda: 0)))"
    8771754605115
    $ PYTHONHASHSEED=1 python3 -c "print(hash((lambda: 0)))"
    8791504743739
    $ PYTHONHASHSEED=1 python3 -c "print(hash((lambda: 0)))"
    8788491320379
    $ PYTHONHASHSEED=1 python3 -c "print(hash((lambda: 0)))"
    8792628055611

    @vstinner
    Copy link
    Member

    In the same Python version, hash(None) always give me the same value. I cannot reproduced your issue on Linux, I tested Python 2.7, 3.3 and 3.4.

    $ python2.7 -c "print(hash(None))"
    17171842026
    $ python2.7 -c "print(hash(None))"
    17171842026
    $ python2.7 -c "print(hash(None))"
    17171842026
    
    $ python3.3 -c "print(hash(None))"
    17171873465
    $ python3.3 -c "print(hash(None))"
    17171873465
    $ python3.3 -c "print(hash(None))"
    17171873465
    
    $ python3.4 -c "print(hash(None))"
    588812
    $ python3.4 -c "print(hash(None))"
    588812
    $ python3.4 -c "print(hash(None))"
    588812

    @vstinner
    Copy link
    Member

    "It's wired and make difficulty for distributed systems partitioning data according hash of keys if the system wants the keys support None."

    How you handle the randomization of hash(str)? (python2.7 -R, enabled by default in Python 3.3).

    @tim-one
    Copy link
    Member

    tim-one commented Oct 12, 2013

    -0.

    Since hash(None) is currently based on None's memory address, I appreciate that it's not reliable (e.g., use different releases of the same compiler to build Python, and hash(None) may be different between them).

    The docs guarantee little about hash() results, so applications relying on cross-machine - or even same-machine cross-run - consistency are broken.

    It's trivial code bloat to special-case None, but it leaves a world of other hash() behaviors as-is (essentially "undefined"). The portable_hash() function in the DPark source is a start at what needs to be done if an application wants reliable hashes. But it's just a start (e.g., it's apparently relying on cross-platform consistency for hash(integer) and hash(string), etc).

    Since CPython will never promise to make all of those consistent across platforms and releases, I'd rather not even start down that road. Making the promise for hash(None) would be an attractive nuisance.

    @rhettinger
    Copy link
    Contributor

    There seems to be a pretty good consensus that this is something we don't want to support.

    @tiran
    Copy link
    Member

    tiran commented Oct 12, 2013

    Tim has convinced me, too.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants