Title: Decimal and long hash, compatibly and efficiently
Type: Stage:
Components: None Versions:
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: facundobatista Nosy List: ajaksu2, facundobatista, mark.dickinson
Priority: normal Keywords: patch

Created on 2007-08-13 04:48 by mark.dickinson, last changed 2007-09-19 17:57 by facundobatista. This issue is now closed.

File name Uploaded Description Edit
decimal_hash.patch mark.dickinson, 2007-08-13 04:48 New Decimal.__hash__ and long.__hash__
long_hash.patch mark.dickinson, 2007-09-15 04:06
decimal_hash_v2.patch mark.dickinson, 2007-09-15 04:07
Messages (7)
msg53024 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-08-13 04:48
Make hash of int/long periodic with period 2**32-1 (or 2**64-1 on 64-bit systems).  This is makes it possible to define an efficient Decimal hash.

Hash for int/long is already close to periodic with period 2**32-1;  it's a small change to make it genuinely periodic.  With this change, it's possible to write a Decimal __hash__ such that:

(1) hash(Decimal(n))==hash(n) for all integers n, and
(2) Decimal hash can be computed without gross inefficiencies.

The current implementation of Decimal hash is almost unusable for very large Decimals:  hash(Decimal("1E999999999")) first converts its argument to a long, taking gigabytes of memory and a lot of time.
It's simple to satisfy either (1) or (2) above, but it seems impossible to satisfy both without touching the long hash.

The patch alters int_hash and long_hash to make them periodic, adds some tests to Lib/test/, and implements an efficient Decimal.__hash__.  If there's any chance of the patch being accepted I'll write some additional tests in Lib/test/ to check hash compatibility between longs and Decimals.

This patch fixes (most of) bug 1770416.
msg55845 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2007-09-12 17:09
IMHO this patch should be considered for  (at least) py3k, in which long
becomes the new int. As there is current interest in long/int
performance[1] this seems like a good time to raise this kind of issue.

 Mark, could you please outline the semantic changes (specially at
wraparound)? Would the hash value changes happen only in "wraparound (=
subtraction +           of 2**(bits_in_long)) by incrementing" cases?  
Would hash(-1) become -1?

Sorry to ask all that, but my C (and maths) is (are) *very* limited. I
guess an invalid hash value is the reason to have hash(-1) == hash(-2)
== -2 currently, if so the period would have to skip that, right? I
don't see the need for "y" and would chain some IFs (if x == -1 OR
abs(x) == ULONG_MAX), but those are weak guesses.

 Maybe it's worth discussing this patch  without linking its most
important change (long hash becoming periodic) to Decimal: it would have
many impacts beyond Decimal and is rather a core change (as opposed to
Lib).  Also,  a recent discussion on hashing[2] shows that sometimes the
current behavior was well planned :)

1 -
2 -
msg55872 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-09-12 20:54
To help explain what's going on, here's some Python code.  The Python
function long_hash1 below has the properties that:

(1) long_hash1(n) == hash(n) for almost all integers n, with a very
small set of exceptions.  (The first positive exception is 2**33-1, and
the next after that is 3*2**33 - 1.  The negative exceptions have the
same absolute value as the positive ones, so the first negative
exception is n = 1-2**33).

(2) long_hash1(n) has a simple closed form, making it possible to
compute an equivalent Decimal hash efficiently.

(3) The current long_hash in Objects/longobject.c can be changed, by
adding a single line of C code, so that hash(n) == long_hash1(n) always.
That line is:

  if ((unsigned long)x < v->ob_digit[i]) x++;

added directly after the addition.

Explanation: the exceptions in (1) arise precisely when the addition

 x += v->ob_digit[i]

in the long_hash code overflows (in the *unsigned* sense---equivalently,
when the addition turns a negative x into a nonnegative one).  Since
ob_digit[i] only has 15 significant bits, and x has 32 (or 64), such
overflow is going to be rare---it'll occur for roughly one addition in
every 2**18 (or about 4 additions in every million), for `random' n.  So
for `most' n, hash(n) and long_hash1(n) are equal.

So what about long_hash2(n)?  This is what the patched long_hash 
is actually equivalent to;  it's essentially the same as long_hash1(n),
but fixed up to be genuinely periodic;  that is, long_hash1(n+ULONG_MAX)
== long_hash1(n) for any integer n.  long_hash1 and long_hash2 are equal
almost always for positive n (the exceptions being multiples of
ULONG_MAX), equal about half the time for negative n, and off-by-one
from each other about half the time.

I don't really know whether the periodicity is that useful---the
predictability is really what matters, so the 1-line change to produce a
hash function equivalent to long_hash1 would do just fine as far as
making Decimal work.

For what it's worth, I regard this patch as an ugly hack.  I don't like
the idea of changing something as fundamental as long.__hash__, and I
don't like having Decimal.__hash__ depend on knowing exactly how
long.__hash__ gets its values.   But there's a real problem here, namely
that, for example,

>>> set([Decimal("1E1000000")])

takes several minutes to complete.  (Try it!)  And I can't think of any
other easy way to solve this.  Alternative suggestions welcomed!

HW = W // 2

def long_hash1(n):
    if n == 0:
        h = 0
        h = (1 + (abs(n)-1) % ULONG_MAX) * (1 if n > 0 else -1)

    # reduce mod W to lie in the usual range for a (signed) C long
    h = (h + HW) % W - HW
    if h == -1:
        h = -2
    return int(h)

def long_hash2(n):
    h = n % ULONG_MAX

    h = (h + HW) % W - HW
    if h == -1:
        h = -2
    return int(h)
msg55875 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2007-09-12 22:28
Thanks a lot for the explanation. I still think it could be valuable
(C?)py3k change  (but now for the predictability, as you say :)

Regarding the performance of Decimal.__hash__:

 I believe (int) hash performance would be a moot issue if
Decimal.__int__ was (much) faster. I've been, for the last  three days,
trying to  convert to use longs instead of tuples for
Decimal._int. Here are some (hand picked) results (on a Celeron M 1.46GHz):

In [906]: D = int_decimal.Decimal
In [914]: g = D("1E1000000")
In [916]: g
Out[916]: Decimal("1E+1000000")
In [918]: %timeit k = int(g)
10 loops, best of 3: 3.22 s per loop
In [920]: %timeit k = hash(int(g))
10 loops, best of 3: 3.28 s per loop
In [922]: log(int(g))
Out[922]: 2302585.0929940455

In [923]: log(10**1000000)
Out[923]: 2302585.0929940455

 I'm not very competent in many skills needed for the job AND I'm
working between field trips (next starts tomorrow), so, unfortunately,
the results are crude and awfully buggy so far. Most "(...)" above were
exceptions for simple things like "hash(g)" and there are lots of these

The conversion is based on using divisions and multiplications to
truncate/extend Decimal._int, from the notion that the bottleneck is the
tuple->str->int conversion. I predict lots of currently fast things
getting slower and a higher memory footprint, but it might still be
worth it and I'm willing to test the idea.

 As I told Facundo in a private email (y otro llegará hoy o mañana), the
simple cases seem workable for me, but I'm not able to tell right now if
this has any real chances of succeeding (because I'm bad at maths :/).
I've not even looked at __pow__, __mul__ and __div__ yet!

Anyway, I'll try to tidy things up as much as I can and email Facundo
(and yourself, if you are interested) the results of this large ugly
hack before now + 24h.
msg55924 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-09-15 04:06
Here's a revised patch, that only touches longobject.c and does the minimum 
necessary to alter long.__hash__ from being *almost* completely predictable to 
being completely predictable.  To summarize the effects of this patch:

 - the current long.__hash__ is predictable except for a very small number of 
(unpredictable) inputs;  this patch makes it predictable for all inputs, thereby 
making it possible to define an efficient Decimal.__hash__.  To be precise, the 
patched long.__hash__ has the property that it's periodic of period ULONG_MAX in 
the ranges [1, infinity) and (-infinity, -1].

 - the value of hash(n) is unchanged for any n that's small enough to be 
representable as an int, and also unchanged for the vast majority of long 
integers n of reasonable size.  (For integers up to 450 digits long, the new hash 
value differs from the old for around 1 in every 2500 integers on a 32-bit 
system;  on a 64-bit system the figure is 1 in every 10000 billion.)

 - On my test system (Dual Xeon 2.8GHz/SuSE Linux 10.2/gcc 4.1), using
timeit.timeit('hash(n)') there's no measurable slowdown for small integers.  
Unfortunately there *is* slowdown for larger integers:  around 20% slowdown for 
100 digit integers and around 70% extra time for 1000 digit integers, though in 
all cases the times are in fractions of a microsecond.  It doesn't help that gcc 
doesn't optimize the inner loop perfectly:  with the -O3 flag, the addition and 
subsequent correction are compiled to (with x in eax and v->ob_digit[i] in edx):

        addl    %eax, %edx
        cmpl    %eax, %edx
        adcl    $0, %edx

and the cmpl here is entirely redundant.  Maybe there's some way of writing the C 
code that makes it easier for gcc to pick up on this optimization?
msg55925 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-09-15 04:07
And here's the updated decimal.__hash__ patch that goes with the 
long.__hash__ patch.
msg56043 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2007-09-19 17:57
Applied the patchs long_hash.patch (rev 58208) and decimal_hash_v2.patch
(rev 58211)
Date User Action Args
2007-09-19 17:57:19facundobatistasetstatus: open -> closed
resolution: fixed
messages: + msg56043
2007-09-15 04:07:43mark.dickinsonsetfiles: + decimal_hash_v2.patch
messages: + msg55925
2007-09-15 04:06:42mark.dickinsonsetfiles: + long_hash.patch
messages: + msg55924
2007-09-12 22:28:27ajaksu2setmessages: + msg55875
2007-09-12 20:54:42mark.dickinsonsetmessages: + msg55872
2007-09-12 17:09:03ajaksu2setnosy: + ajaksu2
messages: + msg55845
2007-08-13 04:48:04mark.dickinsoncreate