Message 349422 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	mark.dickinson, njs, rhettinger, serhiy.storchaka, tim.peters
Date	2019-08-12.00:40:40
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1565570440.61.0.139863913831.issue37807@roundup.psfhosted.org>
In-reply-to

Content
I agree: we "shouldn't have" documented anything about hash codes beyond the invariants needed to guarantee they work for their intended purpose, chiefly that x == y implies hash(x) == hash(y). Which leads to your other question ;-) That invariant is tricky to maintain across a pile of distinct but comparable numeric types! Whatever we return for an int needs also to be what's returned for a float that happens to have the same value, and whatever we return for a float needs also to be what's returned for a fraction.Fraction with the same value, and so on. So Python uses a single numeric hashing algorithm defined on mathematical rationals, described here: https://docs.python.org/3.8/library/stdtypes.html#hashing-of-numeric-types Presumably that's documented so that people defining their own numeric types have a clue about how to implement compatible hash codes. Anyway, that algorithm uses a large fixed prime P (different on 32- and 64-bit boxes), and uses operations modulo P. That's why the full bit width isn't used. And not just any prime P, it picks one for which computing the remainder mod P is especially efficient. 2**61 - 1 is as good as it gets on 64 bit boxes. I didn't pick this algorithm (I suspect Mark did), but I certainly approve of it. It's clever and general enough to apply to any plausible subset-of-real numeric type short of the constructive reals (where equality isn't even theoretically decidable, so "x == y implies ..." can't get off the ground ;-) ).

I agree:  we "shouldn't have" documented anything about hash codes beyond the invariants needed to guarantee they work for their intended purpose, chiefly that x == y implies hash(x) == hash(y).

Which leads to your other question ;-)  That invariant is tricky to maintain across a pile of distinct but comparable numeric types!  Whatever we return for an int needs also to be what's returned for a float that happens to have the same value, and whatever we return for a float needs also to be what's returned for a fraction.Fraction with the same value, and so on.

So Python uses a single numeric hashing algorithm defined on mathematical rationals, described here:

https://docs.python.org/3.8/library/stdtypes.html#hashing-of-numeric-types

Presumably that's documented so that people defining their own numeric types have a clue about how to implement compatible hash codes.

Anyway, that algorithm uses a large fixed prime P (different on 32- and 64-bit boxes), and uses operations modulo P.  That's why the full bit width isn't used.  And not just any prime P, it picks one for which computing the remainder mod P is especially efficient.  2**61 - 1 is as good as it gets on 64 bit boxes.

I didn't pick this algorithm (I suspect Mark did), but I certainly approve of it.  It's clever and general enough to apply to any plausible subset-of-real numeric type short of the constructive reals (where equality isn't even theoretically decidable, so "x == y implies ..." can't get off the ground ;-) ).

History
Date	User	Action	Args
2019-08-12 00:40:40	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, njs, serhiy.storchaka
2019-08-12 00:40:40	tim.peters	set	messageid: <1565570440.61.0.139863913831.issue37807@roundup.psfhosted.org>
2019-08-12 00:40:40	tim.peters	link	issue37807 messages
2019-08-12 00:40:40	tim.peters	create