Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal and long hash, compatibly and efficiently #45306

Closed
mdickinson opened this issue Aug 13, 2007 · 7 comments
Closed

Decimal and long hash, compatibly and efficiently #45306

mdickinson opened this issue Aug 13, 2007 · 7 comments
Assignees

Comments

@mdickinson
Copy link
Member

BPO 1772851
Nosy @facundobatista, @mdickinson, @devdanzin
Files
  • decimal_hash.patch: New Decimal.hash and long.hash
  • long_hash.patch
  • decimal_hash_v2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/facundobatista'
    closed_at = <Date 2007-09-19.17:57:19.166>
    created_at = <Date 2007-08-13.04:48:04.000>
    labels = []
    title = 'Decimal and long hash, compatibly and efficiently'
    updated_at = <Date 2007-09-19.17:57:19.164>
    user = 'https://github.com/mdickinson'

    bugs.python.org fields:

    activity = <Date 2007-09-19.17:57:19.164>
    actor = 'facundobatista'
    assignee = 'facundobatista'
    closed = True
    closed_date = <Date 2007-09-19.17:57:19.166>
    closer = 'facundobatista'
    components = ['None']
    creation = <Date 2007-08-13.04:48:04.000>
    creator = 'mark.dickinson'
    dependencies = []
    files = ['8166', '8431', '8432']
    hgrepos = []
    issue_num = 1772851
    keywords = ['patch']
    message_count = 7.0
    messages = ['53024', '55845', '55872', '55875', '55924', '55925', '56043']
    nosy_count = 3.0
    nosy_names = ['facundobatista', 'mark.dickinson', 'ajaksu2']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1772851'
    versions = []

    @mdickinson
    Copy link
    Member Author

    Make hash of int/long periodic with period 2**32-1 (or 2**64-1 on 64-bit systems). This is makes it possible to define an efficient Decimal hash.

    Hash for int/long is already close to periodic with period 2**32-1; it's a small change to make it genuinely periodic. With this change, it's possible to write a Decimal __hash__ such that:

    (1) hash(Decimal(n))==hash(n) for all integers n, and
    (2) Decimal hash can be computed without gross inefficiencies.

    The current implementation of Decimal hash is almost unusable for very large Decimals: hash(Decimal("1E999999999")) first converts its argument to a long, taking gigabytes of memory and a lot of time.
    It's simple to satisfy either (1) or (2) above, but it seems impossible to satisfy both without touching the long hash.

    The patch alters int_hash and long_hash to make them periodic, adds some tests to Lib/test/test_hash.py, and implements an efficient Decimal.__hash__. If there's any chance of the patch being accepted I'll write some additional tests in Lib/test/test_decimal.py to check hash compatibility between longs and Decimals.

    This patch fixes (most of) bug 1770416.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Sep 12, 2007

    IMHO this patch should be considered for (at least) py3k, in which long
    becomes the new int. As there is current interest in long/int
    performance[1] this seems like a good time to raise this kind of issue.

    Mark, could you please outline the semantic changes (specially at
    wraparound)? Would the hash value changes happen only in "wraparound (=
    subtraction + of 2**(bits_in_long)) by incrementing" cases?
    Would hash(-1) become -1?

    Sorry to ask all that, but my C (and maths) is (are) *very* limited. I
    guess an invalid hash value is the reason to have hash(-1) == hash(-2)
    == -2 currently, if so the period would have to skip that, right? I
    don't see the need for "y" and would chain some IFs (if x == -1 OR
    abs(x) == ULONG_MAX), but those are weak guesses.

    Maybe it's worth discussing this patch without linking its most
    important change (long hash becoming periodic) to Decimal: it would have
    many impacts beyond Decimal and is rather a core change (as opposed to
    Lib). Also, a recent discussion on hashing[2] shows that sometimes the
    current behavior was well planned :)

    1 - http://mail.python.org/pipermail/python-3000/2007-September/010374.html
    2 - http://mail.python.org/pipermail/python-3000/2007-September/010327.html

    @mdickinson
    Copy link
    Member Author

    To help explain what's going on, here's some Python code. The Python
    function long_hash1 below has the properties that:

    (1) long_hash1(n) == hash(n) for almost all integers n, with a very
    small set of exceptions. (The first positive exception is 2**33-1, and
    the next after that is 3*2**33 - 1. The negative exceptions have the
    same absolute value as the positive ones, so the first negative
    exception is n = 1-2**33).

    (2) long_hash1(n) has a simple closed form, making it possible to
    compute an equivalent Decimal hash efficiently.

    (3) The current long_hash in Objects/longobject.c can be changed, by
    adding a single line of C code, so that hash(n) == long_hash1(n) always.
    That line is:

    if ((unsigned long)x < v->ob_digit[i]) x++;

    added directly after the addition.

    Explanation: the exceptions in (1) arise precisely when the addition

    x += v->ob_digit[i]

    in the long_hash code overflows (in the unsigned sense---equivalently,
    when the addition turns a negative x into a nonnegative one). Since
    ob_digit[i] only has 15 significant bits, and x has 32 (or 64), such
    overflow is going to be rare---it'll occur for roughly one addition in
    every 2**18 (or about 4 additions in every million), for random' n. So for most' n, hash(n) and long_hash1(n) are equal.

    So what about long_hash2(n)? This is what the patched long_hash
    is actually equivalent to; it's essentially the same as long_hash1(n),
    but fixed up to be genuinely periodic; that is, long_hash1(n+ULONG_MAX)
    == long_hash1(n) for any integer n. long_hash1 and long_hash2 are equal
    almost always for positive n (the exceptions being multiples of
    ULONG_MAX), equal about half the time for negative n, and off-by-one
    from each other about half the time.

    I don't really know whether the periodicity is that useful---the
    predictability is really what matters, so the 1-line change to produce a
    hash function equivalent to long_hash1 would do just fine as far as
    making Decimal work.

    For what it's worth, I regard this patch as an ugly hack. I don't like
    the idea of changing something as fundamental as long.__hash__, and I
    don't like having Decimal.__hash__ depend on knowing exactly how
    long.__hash__ gets its values. But there's a real problem here, namely
    that, for example,

    >> set([Decimal("1E1000000")])

    takes several minutes to complete. (Try it!) And I can't think of any
    other easy way to solve this. Alternative suggestions welcomed!

    LONG_BITS = 32
    W = 2**LONG_BITS
    HW = W // 2
    ULONG_MAX = W - 1
    
    def long_hash1(n):
        if n == 0:
            h = 0
        else:
            h = (1 + (abs(n)-1) % ULONG_MAX) * (1 if n > 0 else -1)
    
        # reduce mod W to lie in the usual range for a (signed) C long
        h = (h + HW) % W - HW
        if h == -1:
            h = -2
        return int(h)
    
    def long_hash2(n):
        h = n % ULONG_MAX
    
        h = (h + HW) % W - HW
        if h == -1:
            h = -2
        return int(h)

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Sep 12, 2007

    Thanks a lot for the explanation. I still think it could be valuable
    (C?)py3k change (but now for the predictability, as you say :)

    Regarding the performance of Decimal.__hash__:

    I believe (int) hash performance would be a moot issue if
    Decimal.__int__ was (much) faster. I've been, for the last three days,
    trying to convert decimal.py to use longs instead of tuples for
    Decimal._int. Here are some (hand picked) results (on a Celeron M 1.46GHz):

    In [906]: D = int_decimal.Decimal
    (...)
    In [914]: g = D("1E1000000")
    (...)
    In [916]: g
    Out[916]: Decimal("1E+1000000")
    (...)
    In [918]: %timeit k = int(g)
    10 loops, best of 3: 3.22 s per loop
    (...)
    In [920]: %timeit k = hash(int(g))
    10 loops, best of 3: 3.28 s per loop
    (...)
    In [922]: log(int(g))
    Out[922]: 2302585.0929940455

    In [923]: log(10**1000000)
    Out[923]: 2302585.0929940455

    I'm not very competent in many skills needed for the job AND I'm
    working between field trips (next starts tomorrow), so, unfortunately,
    the results are crude and awfully buggy so far. Most "(...)" above were
    exceptions for simple things like "hash(g)" and there are lots of these
    issues.

    The conversion is based on using divisions and multiplications to
    truncate/extend Decimal._int, from the notion that the bottleneck is the
    tuple->str->int conversion. I predict lots of currently fast things
    getting slower and a higher memory footprint, but it might still be
    worth it and I'm willing to test the idea.

    As I told Facundo in a private email (y otro llegará hoy o mañana), the
    simple cases seem workable for me, but I'm not able to tell right now if
    this has any real chances of succeeding (because I'm bad at maths :/).
    I've not even looked at __pow__, __mul__ and __div__ yet!

    Anyway, I'll try to tidy things up as much as I can and email Facundo
    (and yourself, if you are interested) the results of this large ugly
    hack before now + 24h.

    @mdickinson
    Copy link
    Member Author

    Here's a revised patch, that only touches longobject.c and does the minimum
    necessary to alter long.__hash__ from being *almost* completely predictable to
    being completely predictable. To summarize the effects of this patch:

    • the current long.__hash__ is predictable except for a very small number of
      (unpredictable) inputs; this patch makes it predictable for all inputs, thereby
      making it possible to define an efficient Decimal.__hash__. To be precise, the
      patched long.__hash__ has the property that it's periodic of period ULONG_MAX in
      the ranges [1, infinity) and (-infinity, -1].

    • the value of hash(n) is unchanged for any n that's small enough to be
      representable as an int, and also unchanged for the vast majority of long
      integers n of reasonable size. (For integers up to 450 digits long, the new hash
      value differs from the old for around 1 in every 2500 integers on a 32-bit
      system; on a 64-bit system the figure is 1 in every 10000 billion.)

    • On my test system (Dual Xeon 2.8GHz/SuSE Linux 10.2/gcc 4.1), using
      timeit.timeit('hash(n)') there's no measurable slowdown for small integers.
      Unfortunately there *is* slowdown for larger integers: around 20% slowdown for
      100 digit integers and around 70% extra time for 1000 digit integers, though in
      all cases the times are in fractions of a microsecond. It doesn't help that gcc
      doesn't optimize the inner loop perfectly: with the -O3 flag, the addition and
      subsequent correction are compiled to (with x in eax and v->ob_digit[i] in edx):

       addl    %eax, %edx
       cmpl    %eax, %edx
       adcl    $0, %edx
      

    and the cmpl here is entirely redundant. Maybe there's some way of writing the C
    code that makes it easier for gcc to pick up on this optimization?

    @mdickinson
    Copy link
    Member Author

    And here's the updated decimal.__hash__ patch that goes with the
    long.__hash__ patch.

    @facundobatista
    Copy link
    Member

    Applied the patchs long_hash.patch (rev 58208) and decimal_hash_v2.patch
    (rev 58211)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants