This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Make hash() return a non-negative number
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: mark.dickinson, njs, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2019-08-10 06:24 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15199 closed serhiy.storchaka, 2019-08-10 15:36
Messages (9)
msg349333 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-10 06:24
The existing API for the builtin hash() function is inconvenient.  When implementing structure that use a hash value, you never want a negative number.  Running abs(hash(x)) works but is awkward, slower, and loses one bit of entropy.
msg349341 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-10 15:40
The only problem is that the current algorithm for numeric types is documented to return negative numbers.
https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
msg349342 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-10 16:36
We think can clarify that those docs describe int.__hash__, float.__hash__, and complex.__hash__.   

The builtin hash() function can and does make transformations on the result from the __dunder__ method:

    >>> class A:
	    def __hash__(self):
 		return 123423412341234123412341234

    >>> hash(A())
    1656462513496965462
msg349347 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-08-10 17:19
This is a change of documented behaviour that will affect 3rd party libraries that provide numeric types, like NumPy and gmpy; it should not be done lightly. I think such a change deserves at least a deprecation warning, and should also be discussed more widely than just this issue.
msg349349 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-10 18:32
Currently the hash for integers and several other types uses 62 bits of 64 (on 64-bit platform). The domain of values is continuous:

    -sys.hash_info.modulus < h < sys.hash_info.modulus

If interpret Py_hash_t as Py_uhash_t, it will no longer be continuous:

    0 <= h < sys.hash_info.modulus or 2**sys.hash_info.width-sys.hash_info.modulus < h < 2**sys.hash_info.width

The hash for tuples is written in term of signed hashes of elements. If write it in Python it will look more complicated for unsigned hashes.

What are problems with using negative hash values? In hash tables you squeeze it to valid range by evaluating `h % size` of `h & mask`. Both expressions give non-negative result.
msg349350 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-08-10 18:33
Well, I have no code that would benefit from this change.  What's the point?  Sure, I use _parts_ of hash codes at times, but, e.g.,

    index = the_hash_code & SOME_LOW_BIT_MASK
    
in Python couldn't care less about the sign of `the_hash_code` - it returns a non-negative int regardless.  The same applies if, e.g., SOME_LOW_BIT_MASK is `(1 << sys.hash_info.hash_bits) - 1`, giving a full-width non-negative hash without losing a bit of "entropy".  But I've never felt a need for the latter.  Likewise for doing, e.g.,

    index = the_hash_code % SOME_POSITIVE_PRIME

in Python.  % takes the sign (positive here) of the modulus in Python.

So this _looks_ like pointlessly disruptive code churn to me.  What killer use case am I missing?
msg349419 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-11 23:57
> What killer use case am I missing?

There's no killer use case. I just thought it weird that hte builtin hash() function ever returned a negative number.

> This is a change of documented behaviour that will affect 
> 3rd party libraries that provide numeric types, like NumPy
> and gmpy; it should not be done lightly.

Okay, I'm happy to drop this.  It was only a minor nuisance.

Am not sure at what point we ever guaranteed that hash() would return negative numbers, or any particular number.  That was probably a mistake.  I thought the guarantees were all higher level metamorphic invariants such as hash(x) == hash(y) when x == y and {type(x), type(y)} ⊂ {bool, int, float, Decimal, Fraction, complex}.

BTW, I'm not sure why the numeric hashes are not getting all 64 bits in to play:

    from random import randrange, expovariate
    from secrets import token_hex, token_bytes

    for random_hash in [
        lambda : hash(randrange(2**100)) % 2**64,
        lambda : abs(hash(expovariate(1.0))) % 2**64,
        lambda : hash(token_hex()) % 2**64,
        lambda : hash(token_bytes()) % 2**64,
    ]:
        print(max(random_hash().bit_length() for i in range(10_000)))

Outputs:

    61
    61
    64
    64
msg349422 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-08-12 00:40
I agree:  we "shouldn't have" documented anything about hash codes beyond the invariants needed to guarantee they work for their intended purpose, chiefly that x == y implies hash(x) == hash(y).

Which leads to your other question ;-)  That invariant is tricky to maintain across a pile of distinct but comparable numeric types!  Whatever we return for an int needs also to be what's returned for a float that happens to have the same value, and whatever we return for a float needs also to be what's returned for a fraction.Fraction with the same value, and so on.

So Python uses a single numeric hashing algorithm defined on mathematical rationals, described here:

https://docs.python.org/3.8/library/stdtypes.html#hashing-of-numeric-types

Presumably that's documented so that people defining their own numeric types have a clue about how to implement compatible hash codes.

Anyway, that algorithm uses a large fixed prime P (different on 32- and 64-bit boxes), and uses operations modulo P.  That's why the full bit width isn't used.  And not just any prime P, it picks one for which computing the remainder mod P is especially efficient.  2**61 - 1 is as good as it gets on 64 bit boxes.

I didn't pick this algorithm (I suspect Mark did), but I certainly approve of it.  It's clever and general enough to apply to any plausible subset-of-real numeric type short of the constructive reals (where equality isn't even theoretically decidable, so "x == y implies ..." can't get off the ground ;-) ).
msg349424 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-12 00:54
Withdrawing this issue and marking as closed.

Thanks for your thoughts on the subject.
History
Date User Action Args
2022-04-11 14:59:18adminsetgithub: 81988
2019-08-12 00:54:24rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg349424

stage: patch review -> resolved
2019-08-12 00:40:40tim.peterssetmessages: + msg349422
2019-08-11 23:57:49rhettingersetmessages: + msg349419
2019-08-10 18:33:42tim.peterssetmessages: + msg349350
2019-08-10 18:32:40serhiy.storchakasetmessages: + msg349349
2019-08-10 17:19:15mark.dickinsonsetnosy: + njs
messages: + msg349347
2019-08-10 16:36:05rhettingersetmessages: + msg349342
2019-08-10 15:40:20serhiy.storchakasetnosy: + serhiy.storchaka, mark.dickinson
messages: + msg349341
2019-08-10 15:36:12serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request14928
2019-08-10 06:24:05rhettingercreate