This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: try to use hashtable in pickle
Type: enhancement Stage:
Components: Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: alexandre.vassalotti, neologix, pitrou, serhiy.storchaka, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2014-05-22 20:55 by neologix, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
pickle_use_hashtable.diff neologix, 2014-05-22 20:55 review
Messages (7)
msg218920 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2014-05-22 20:55
This patch is an attempt at making pickle use Modules/hashtable.{h,c} instead of its hash table ad-hoc implementation for its memoization table.
I'm saying attempt, because although it works correctly, some benchmarks are actually slower.
I didn't profile it, so I don't know if it's due to the hashtable implementation, function call overheads, etc.

If we manage to bring this on par with pickle's ad-hoc implementation, it would probably be interesting to replace it. If not, then we can just drop this patch :-)

Also, there might be other places in the code base which might benefit from this generic hashtable, maybe.
msg218948 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-23 08:43
> I'm saying attempt, because although it works correctly, some benchmarks are actually slower.
> I didn't profile it, so I don't know if it's due to the hashtable implementation, function call overheads, etc.

It probably shows that Python dicts (which the pickle hashtable is a rip-off of, module refcounting) are more optimized than the average hash table implementation :-)

But also, _Py_hashtable_hash_ptr is quite crude in that it doesn't even try to compensate for pointer alignment, so there will automatically be many collisions. Only one hash bucket every 8 or 16 will be used, at best.

And the straightforward collision resolution in hashtable.c is much less efficient at mitigating collisions than a Python dict's.
msg218949 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-05-23 08:56
"_Py_hashtable_hash_ptr is quite crude in that it doesn't even try to compensate for pointer alignment, so there will automatically be many collisions. Only one hash bucket every 8 or 16 will be used, at best."

I chose to use _Py_HashPointer() to drop (shift) lower bits. Do you mean that it's not enough? Python memory allocator uses an alignement on 8 bytes (2**3).

Py_uhash_t
_Py_hashtable_hash_ptr(const void *key)
{
    return (Py_uhash_t)_Py_HashPointer((void *)key);
}

Py_hash_t
_Py_HashPointer(void *p)
{
    Py_hash_t x;
    size_t y = (size_t)p;
    /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
       excessive hash collisions for dicts and sets */
    y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
    x = (Py_hash_t)y;
    if (x == -1)
        x = -2;
    return x;
}
msg218950 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-23 08:57
> "_Py_hashtable_hash_ptr is quite crude in that it doesn't even try to compensate for pointer alignment, so there will automatically be many collisions. Only one hash bucket every 8 or 16 will be used, at best."
> 
> I chose to use _Py_HashPointer() to drop (shift) lower bits. Do you
> mean that it's not enough? Python memory allocator uses an alignement
> on 8 bytes (2**3).

Ah, sorry, I just hadn't realized that.
msg218951 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-05-23 09:01
> And the straightforward collision resolution in hashtable.c is much less efficient at mitigating collisions than a Python dict's.

Modules/hashtable.c comes from http://sourceforge.net/projects/libcfu/ (cfuhash type). I adapted the code for my needs (the tracemalloc module), but I didn't try to make it fast. My main change was to store data directly in an entry of a table instead of using a second memory block for the data (and a pointer in the hash table entry).

I didn't want to use the Python dict type for tracemalloc because this type may use the Python memory allocator which would lead to reentrant calls to tracemalloc. It's also nice to have a function to compute exactly the memory usage of an hash table (table + data), it's exposed in tracemalloc.get_tracemalloc_memory().

It doesn't mean that I want hashtable.c to be slow :-) Feel free to modify it as you want. But trying to make it as fast as Python dict would be hard. The design is different. Python dict uses open addressing, whereas _Py_hashtable uses linked list to store entries. Python dict is well optimized.
msg218952 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-05-23 09:03
> I didn't want to use the Python dict type for tracemalloc because this
> type may use the Python memory allocator which would lead to reentrant
> calls to tracemalloc.

Ah, so this means CF's patch will make the pickle memotable invisible to
tracemalloc?

> It doesn't mean that I want hashtable.c to be slow :-) Feel free to
> modify it as you want. But trying to make it as fast as Python dict
> would be hard. The design is different. Python dict uses open
> addressing, whereas _Py_hashtable uses linked list to store entries.
> Python dict is well optimized.

Yes, optimizing hashtable.c probably means converting it to the same
hashing scheme as dicts (or the current memotable in pickle.c).
msg218953 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-05-23 09:26
"Ah, so this means CF's patch will make the pickle memotable invisible to tracemalloc?"

Currently, _Py_hashtabe uses PyMem_RawMalloc and PyMem_RawFree by default (realloc is not needed, we need to keep the previous buckets on rehash). Using _Py_hashtable_new_full, you can choose a different memory allocator.

Hum, wait. tracemalloc uses trace PyMem_RawMalloc and PyMem_RawFree. In fact, tracemalloc ignores reentrant calls to the memory allocator. So now I don't remember exactly why I chose a custom hash table implementation :-) Maybe because Python dict requires Python objects with reference counter, use the garbage collector, etc.
History
Date User Action Args
2022-04-11 14:58:03adminsetgithub: 65755
2014-05-23 09:26:44vstinnersetmessages: + msg218953
2014-05-23 09:03:31pitrousetmessages: + msg218952
2014-05-23 09:01:10vstinnersetmessages: + msg218951
2014-05-23 08:57:26pitrousetmessages: + msg218950
2014-05-23 08:56:06vstinnersetmessages: + msg218949
2014-05-23 08:43:42pitrousetnosy: + tim.peters
messages: + msg218948
2014-05-22 22:37:50ned.deilysetnosy: + alexandre.vassalotti
2014-05-22 20:55:07neologixcreate