Message 55872 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mark.dickinson
Recipients	ajaksu2, facundobatista, mark.dickinson
Date	2007-09-12.20:54:41
SpamBayes Score	0.0008566817
Marked as misclassified	No
Message-id	<1189630482.56.0.708229211175.issue1772851@psf.upfronthosting.co.za>
In-reply-to

Content
To help explain what's going on, here's some Python code. The Python function long_hash1 below has the properties that: (1) long_hash1(n) == hash(n) for almost all integers n, with a very small set of exceptions. (The first positive exception is 2*33-1, and the next after that is 3233 - 1. The negative exceptions have the same absolute value as the positive ones, so the first negative exception is n = 1-233). (2) long_hash1(n) has a simple closed form, making it possible to compute an equivalent Decimal hash efficiently. (3) The current long_hash in Objects/longobject.c can be changed, by adding a single line of C code, so that hash(n) == long_hash1(n) always. That line is: if ((unsigned long)x < v->ob_digit[i]) x++; added directly after the addition. Explanation: the exceptions in (1) arise precisely when the addition x += v->ob_digit[i] in the long_hash code overflows (in the unsigned sense---equivalently, when the addition turns a negative x into a nonnegative one). Since ob_digit[i] only has 15 significant bits, and x has 32 (or 64), such overflow is going to be rare---it'll occur for roughly one addition in every 218 (or about 4 additions in every million), for `random' n. So for `most' n, hash(n) and long_hash1(n) are equal. So what about long_hash2(n)? This is what the patched long_hash is actually equivalent to; it's essentially the same as long_hash1(n), but fixed up to be genuinely periodic; that is, long_hash1(n+ULONG_MAX) == long_hash1(n) for any integer n. long_hash1 and long_hash2 are equal almost always for positive n (the exceptions being multiples of ULONG_MAX), equal about half the time for negative n, and off-by-one from each other about half the time. I don't really know whether the periodicity is that useful---the predictability is really what matters, so the 1-line change to produce a hash function equivalent to long_hash1 would do just fine as far as making Decimal work. For what it's worth, I regard this patch as an ugly hack. I don't like the idea of changing something as fundamental as long.__hash__, and I don't like having Decimal.__hash__ depend on knowing exactly how long.__hash__ gets its values. But there's a real problem here, namely that, for example, >>> set([Decimal("1E1000000")]) takes several minutes to complete. (Try it!) And I can't think of any other easy way to solve this. Alternative suggestions welcomed! LONG_BITS = 32 W = 2LONG_BITS HW = W // 2 ULONG_MAX = W - 1 def long_hash1(n): if n == 0: h = 0 else: h = (1 + (abs(n)-1) % ULONG_MAX) * (1 if n > 0 else -1) # reduce mod W to lie in the usual range for a (signed) C long h = (h + HW) % W - HW if h == -1: h = -2 return int(h) def long_hash2(n): h = n % ULONG_MAX h = (h + HW) % W - HW if h == -1: h = -2 return int(h)

To help explain what's going on, here's some Python code.  The Python
function long_hash1 below has the properties that:

(1) long_hash1(n) == hash(n) for almost all integers n, with a very
small set of exceptions.  (The first positive exception is 2**33-1, and
the next after that is 3*2**33 - 1.  The negative exceptions have the
same absolute value as the positive ones, so the first negative
exception is n = 1-2**33).

(2) long_hash1(n) has a simple closed form, making it possible to
compute an equivalent Decimal hash efficiently.

(3) The current long_hash in Objects/longobject.c can be changed, by
adding a single line of C code, so that hash(n) == long_hash1(n) always.
That line is:

  if ((unsigned long)x < v->ob_digit[i]) x++;

added directly after the addition.

Explanation: the exceptions in (1) arise precisely when the addition

 x += v->ob_digit[i]

in the long_hash code overflows (in the *unsigned* sense---equivalently,
when the addition turns a negative x into a nonnegative one).  Since
ob_digit[i] only has 15 significant bits, and x has 32 (or 64), such
overflow is going to be rare---it'll occur for roughly one addition in
every 2**18 (or about 4 additions in every million), for `random' n.  So
for `most' n, hash(n) and long_hash1(n) are equal.

So what about long_hash2(n)?  This is what the patched long_hash 
is actually equivalent to;  it's essentially the same as long_hash1(n),
but fixed up to be genuinely periodic;  that is, long_hash1(n+ULONG_MAX)
== long_hash1(n) for any integer n.  long_hash1 and long_hash2 are equal
almost always for positive n (the exceptions being multiples of
ULONG_MAX), equal about half the time for negative n, and off-by-one
from each other about half the time.

I don't really know whether the periodicity is that useful---the
predictability is really what matters, so the 1-line change to produce a
hash function equivalent to long_hash1 would do just fine as far as
making Decimal work.

For what it's worth, I regard this patch as an ugly hack.  I don't like
the idea of changing something as fundamental as long.__hash__, and I
don't like having Decimal.__hash__ depend on knowing exactly how
long.__hash__ gets its values.   But there's a real problem here, namely
that, for example,

>>> set([Decimal("1E1000000")])

takes several minutes to complete.  (Try it!)  And I can't think of any
other easy way to solve this.  Alternative suggestions welcomed!



LONG_BITS = 32
W = 2**LONG_BITS
HW = W // 2
ULONG_MAX = W - 1

def long_hash1(n):
    if n == 0:
        h = 0
    else:
        h = (1 + (abs(n)-1) % ULONG_MAX) * (1 if n > 0 else -1)

    # reduce mod W to lie in the usual range for a (signed) C long
    h = (h + HW) % W - HW
    if h == -1:
        h = -2
    return int(h)

def long_hash2(n):
    h = n % ULONG_MAX

    h = (h + HW) % W - HW
    if h == -1:
        h = -2
    return int(h)

History
Date	User	Action	Args
2007-09-12 20:54:42	mark.dickinson	set	spambayes_score: 0.000856682 -> 0.0008566817 recipients: + mark.dickinson, facundobatista, ajaksu2
2007-09-12 20:54:42	mark.dickinson	set	spambayes_score: 0.000856682 -> 0.000856682 messageid: <1189630482.56.0.708229211175.issue1772851@psf.upfronthosting.co.za>
2007-09-12 20:54:42	mark.dickinson	link	issue1772851 messages
2007-09-12 20:54:41	mark.dickinson	create