classification
Title: tuples should remember their hash value
Type: resource usage Stage: needs patch
Components: Versions: Python 3.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, dtorp, rhettinger, tim_one
Priority: low Keywords: easy

Created on 2010-08-25 19:33 by dtorp, last changed 2010-10-31 16:28 by rhettinger.

Messages (6)
msg114929 - (view) Author: David Albert Torpey (dtorp) Date: 2010-08-25 19:33
Dictionary keys are commonly numbers, strings, or tuples. Python has optimized numbers and strings to remember their hash values on successive calls. Tuples should do this too since their recursive hash function can take a long time to compute. Tuples are Python's official record type and the one obvious way of making non-scalar dictionary keys.

The code to do this in stringobject.c is short and sweet, so this major speed boost should be an easy thing to. 

static long
string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

    if (a->ob_shash != -1)               <==
        return a->ob_shash;              <==
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)                         <==
        x = -2;                          <==
    a->ob_shash = x;
    return x;
}

The code in tupleobject.c would just need to add the four lines marked above. Here's what is looks like now.

static long
tuplehash(PyTupleObject *v)
{
    register long x, y;
    register Py_ssize_t len = Py_SIZE(v);
    register PyObject **p;
    long mult = 1000003L;
    x = 0x345678L;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        x = (x ^ y) * mult;
        /* the cast might truncate len; that doesn't change hash stability */
        mult += (long)(82520L + len + len);
    }
    x += 97531L;
    if (x == -1)
        x = -2;
    return x;
}

Thank you guys for all of your work.

*David
msg114930 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-25 19:36
This seems reasonable.  Will look at it in the next few days.
msg114932 - (view) Author: Tim Peters (tim_one) * (Python committer) Date: 2010-08-25 20:00
- Tuple objects don't currently reserve space to store their hash code, so it's likely this would increase the size of every tuple.

- It's unclear to me which natural use patterns would actually enjoy a major speed boost.  Note that dicts remember the hash codes of keys already, regardless of whether the key type remembers them too.  A tuple is typically constructed right before being used in a dict lookup, so for a one-shot use no time would be saved.  If the new tuple is used in multiple dict lookups, sure - but is that common?
msg114936 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-08-25 21:37
FWIW, I'm -1 on this without a demonstrable improvement on some real-world cases.
msg114937 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-25 22:04
Hello Tim! If you have a chance, please also take a look at issue9685 which I was planning to work on in the next couple of weeks.

For memoizing tuple hashes, I'm inclined to think the one extra field is worth it.  That would help all the cases where people are concerned about double accesses to dicts in a look-before-you-leap pattern or for a pattern of fetch-item-update-value-store-new-item.

It looks like the code for collections.OrderedDict() would benefit because it does multiple lookups and stores on the same key: http://svn.python.org/view/python/branches/release27-maint/Lib/collections.py?revision=84148&view=markup

It would also help the multiple lookups and stores in caching code such as that at http://code.activestate.com/recipes/498245-lru-and-lfu-cache-decorators

I suppose we could prepare a patch, instrument it, and try it with Twisted, SQLalchemy, and Django to find-out how many tuple hash calculations would be saved by memoizing.
msg114939 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-08-25 22:14
2010/8/25 Raymond Hettinger <report@bugs.python.org>:
> I suppose we could prepare a patch, instrument it, and try it with Twisted, SQLalchemy, and Django to find-out how many tuple hash calculations would be saved by memoizing.

You should also check memory usage.
History
Date User Action Args
2010-10-31 16:28:58rhettingersetassignee: rhettinger ->
2010-08-25 22:14:50benjamin.petersonsetmessages: + msg114939
2010-08-25 22:04:14rhettingersetmessages: + msg114937
2010-08-25 21:37:33benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg114936
2010-08-25 20:00:02tim_onesetnosy: + tim_one
messages: + msg114932
2010-08-25 19:36:11rhettingerset
nosy: + rhettinger
versions: + Python 3.2, - Python 2.6
messages: + msg114930
priority: normal -> low
assignee: rhettinger
keywords: + easy
stage: needs patch
2010-08-25 19:33:15dtorpcreate