dict viewkeys intersection slow for large dicts #71762

DavidSu2 · 2016-07-19T17:51:48Z

BPO	27575
Nosy	@rhettinger, @serhiy-storchaka, @fgregg
PRs	bpo-27575: port set intersection logic into dictview intersection #7696
Files	dict_view_intersection.patch dictview_intersection_test.py performance_test.sh

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = <Date 2019-09-18.07:11:26.419>
created_at = <Date 2016-07-19.17:51:47.621>
labels = ['library', '3.9', 'performance']
title = 'dict viewkeys intersection slow for large dicts'
updated_at = <Date 2019-09-18.07:11:26.418>
user = 'https://bugs.python.org/DavidSu2'

bugs.python.org fields:

activity = <Date 2019-09-18.07:11:26.418>
actor = 'serhiy.storchaka'
assignee = 'rhettinger'
closed = True
closed_date = <Date 2019-09-18.07:11:26.419>
closer = 'serhiy.storchaka'
components = ['Library (Lib)']
creation = <Date 2016-07-19.17:51:47.621>
creator = 'David Su2'
dependencies = []
files = ['43959', '43960', '43961']
hgrepos = []
issue_num = 27575
keywords = ['patch']
message_count = 14.0
messages = ['270836', '270905', '270906', '271035', '271095', '271770', '274563', '319469', '319488', '319575', '350500', '350501', '352704', '352706']
nosy_count = 5.0
nosy_names = ['rhettinger', 'serhiy.storchaka', 'Mali Akmanalp', 'David Su2', 'fgregg']
pr_nums = ['7696']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue27575'
versions = ['Python 3.9']

DavidSu2 · 2016-07-19T17:51:47Z

In dictobject.c, the function dictviews_and performs a very expensive set creation operation on the keys of the self dict object:

From Python 2.7:
...
static PyObject*
dictviews_and(PyObject* self, PyObject *other)
{
    PyObject *result = PySet_New(self);
...
    tmp = PyObject_CallMethod(result, "intersection_update", "(O)", other);
...

Similar logic can be found in the Python 3 code as well.

For large dicts (~10 000 000 keys), it takes a significant amount of time to copy the keys into a new set, and also wastes a lot of memory. This issue is amplified when the intersection operation is performed many times.

This implementation uses O(len(self)) time and space, while it is expected to use O(min(len(self), len(other)) time and space.

Solution 1: Manually copy the keys of the dict into a set, and perform all intersection operations on that set. However, still wastes lots of memory.

Solution 2 (Recommended): Port the intersection logic from the set class to the dict view class.

rhettinger · 2016-07-21T03:30:38Z

FWIW, sets and dicts convert back and forth very efficiently (the hash values are reused and the collection is presized with just a single memory allocation). Also, the sets and dicts have shared pointers to the underlying data (which is often much bigger than the containers that refers to the data).

Also, the use case of having very large dicts with repeated intersection operations isn't common enough to warrant any significant code expansion (your solution two).

I don't understand your "solution one". PySet_New(self) does copy references to the keys from the dict to a set and the intersection_update() does perform operations on that set.

Lastly, since there was no clear example given, there isn't an opportunity to look at it to see if the code would have been better designed to simply use sets for the set operations and keep the dict around to the key to value transformations (in other words, the current toolset may already afford some way to accomplish the goal).

DavidSu2 · 2016-07-21T04:28:37Z

Here are some tests that I ran:

Using current naive viewkeys intersection:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)" "d.viewkeys() & {0}"
100 loops, best of 3: 447 msec per loop

Nearly half a second per iteration is unacceptably slow.

Solution 1 from my previous comment:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0); s = set(d)" "s & {0}"
100 loops, best of 3: 0.391 usec per loop

This solution caches the keys of the dict in a set, then performs all intersections on that set.
The initial creation of the set is slow, and it wastes a lot of memory, but each iteration is much faster.

The workaround that I ended up using:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(10000000), 0)" "{item for item in {0} if item in d}"
100 loops, best of 3: 0.629 usec per loop

This solution just looks up each value in the set to see if it is also in the dict using a set comprehension.
This avoids wasting time and space on creating a set. For my use case, memory was a bigger issue as my dictionary can approach tens of GBs in size, and almost doubling the memory consumption from creating a cache set was unacceptable.

Of course, in the set comprehension, the smaller of the dict/set should be iterated over to satisfy the O(min(len(d), len(s))) condition as specified in https://wiki.python.org/moin/TimeComplexity.
A implementation would look something like this:

def efficient_intersection(smaller_dict_or_set, bigger_dict_or_set):
    if len(bigger_dict_or_set) < len(smaller_dict_or_set):
        bigger_dict_or_set, smaller_dict_or_set = smaller_dict_or_set, bigger_dict_or_set
    
    return {item for item in smaller_dict_or_set if item in bigger_dict_or_set}

In conclusion, porting over the set intersection logic to viewkeys would be the ideal solution. It avoids wasting memory like in the set cache solution, and I am sure the C implementation will be much faster than my workaround of manually performing an intersection using set comprehensions.

My view is that dict.viewkeys() & set(...) should be as performant as the syntax suggests.

rhettinger · 2016-07-22T21:23:55Z

Do you encounter an actual real-world use case or is this just a theoretical issue? ISTM, the performance is already good enough for most practical applications.

DavidSu2 · 2016-07-23T18:05:29Z

I am trying to implement the following algorithm:

http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

This algorithm is used for fast nearest neighbor queries for spell correction.

Given a corpus of strings, for each string s, I generate all subsequences (Note: subsequence, not substring) of that string length (len(n) - k), call that subsequence t, where k is a tunable parameter. Then, I create a dict mapping the subsequences t to the original string s.

This process creates a very large dict from a relatively small corpus. For example, my corpus of ~500k strings blew up to a dict of ~10M keys, with k=3.

Then, for a query string q, you generate all subsequences of (len(q) - k), and perform an intersection with the keys of the dict. Then, the values of the dict corresponding to the key intersection will be possible misspelling of the query string q.

In pseudo-python:

def get_subseq(s: str, k: int):
    returns all subsequences of s of length len(s) - k

def build_dict(corpus: List[str], k: int):
    d = defaultdict(list)

    for orig_str in corpus:
        for subseq in get_subseq(orig_str, k):
            d[subseq].append(orig_str)

    return d

def query(d: dict, q: str, k:int):
    q_subseq = set(get_subseq(q, k))
    intersection = d.viewkeys() & q_subseq # this is slow when len(d) is large
    return [orig_str for k in intersection for orig_str in d[k]]

By exposing an intersection interface, a certain degree of performance is expected by the user. It took me a considerable amount of debugging to realize that the dict.viewkeys() intersection was the performance bottleneck. I finally decided to dig into the CPython source code before discovering the root cause of the issue. While I am familiar with C and found the issue relatively quickly, for those that aren't, it should not be necessary to dig into the source code for a mainstream programming language like Python.

I am currently looking at how to potentially fix this in the CPython repository. I may submit a patch in the near future if I find a good solution.

DavidSu2 · 2016-08-01T06:34:07Z

Attached is a patch that I had been working on. Could you please review and provide me with some feedback? Thanks!

DavidSu2 · 2016-09-06T16:25:10Z

ping

fgregg · 2018-06-13T14:50:46Z

Is there anything helpful I can do to help close this issue? Maybe convert it into a github PR?

Since Raymond asked for cases where this issue is a problem, I'll add the case that brought me here.

I have an application where I need to do thousands of intersections of multisets. I started with the collections.Counter object, but the current intersection method is too slow. As Counter is a subclass of dict, I thought that I could significantly speed it up by taking advantage of keys intersections.

I've been able to verify that if key intersection was roughly similar in speed to set intersection, than that would be very helpful. However, the current speed of key intersection does not make that practicable. I can, of course, cast the keys to sets before intersecting, but as David points out that casting is what is taking significant time.

slow dictionary intersection for becoming larger dicts is becoming a problem for me

rhettinger · 2018-06-13T23:54:16Z

Is there anything helpful I can do to help close this issue?
Maybe convert it into a github PR?

Yes, that would be nice. Also, please verify that the test cases cover all the code paths.

fgregg · 2018-06-15T02:41:28Z

Hi Raymond,

I've created a PR here: #7696, and
I've verified that there are tests for all the code paths.

I reached out to Daniel Hsu and he has given his blessing to help push this forward.

I think this is ready for your review. Please let me know if there's anything else I can do.

Thanks,

Forest

rhettinger · 2019-08-26T07:17:47Z

New changeset 998cf1f by Raymond Hettinger (Forest Gregg) in branch 'master':
bpo-27575: port set intersection logic into dictview intersection (GH-7696)
998cf1f

rhettinger · 2019-08-26T07:18:50Z

Thanks David and Forest.

serhiy-storchaka · 2019-09-18T07:01:30Z

I am puzzled why we still call set.intersection_update with empty set?

serhiy-storchaka · 2019-09-18T07:11:11Z

This changed behavior. See bpo-38210.

See also bpo-38202 for other regression.

DavidSu2 mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Jul 19, 2016

rhettinger self-assigned this Sep 6, 2016

rhettinger closed this as completed Aug 26, 2019

serhiy-storchaka reopened this Sep 18, 2019

serhiy-storchaka added the 3.9 only security fixes label Sep 18, 2019

serhiy-storchaka closed this as completed Sep 18, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dict viewkeys intersection slow for large dicts #71762

dict viewkeys intersection slow for large dicts #71762

DavidSu2 mannequin commented Jul 19, 2016

DavidSu2 mannequin commented Jul 19, 2016

rhettinger commented Jul 21, 2016

DavidSu2 mannequin commented Jul 21, 2016

rhettinger commented Jul 22, 2016

DavidSu2 mannequin commented Jul 23, 2016

DavidSu2 mannequin commented Aug 1, 2016

DavidSu2 mannequin commented Sep 6, 2016

fgregg mannequin commented Jun 13, 2018

rhettinger commented Jun 13, 2018

fgregg mannequin commented Jun 15, 2018

rhettinger commented Aug 26, 2019

rhettinger commented Aug 26, 2019

serhiy-storchaka commented Sep 18, 2019

serhiy-storchaka commented Sep 18, 2019

dict viewkeys intersection slow for large dicts #71762

dict viewkeys intersection slow for large dicts #71762

Comments

DavidSu2 mannequin commented Jul 19, 2016

DavidSu2 mannequin commented Jul 19, 2016

rhettinger commented Jul 21, 2016

DavidSu2 mannequin commented Jul 21, 2016

rhettinger commented Jul 22, 2016

DavidSu2 mannequin commented Jul 23, 2016

DavidSu2 mannequin commented Aug 1, 2016

DavidSu2 mannequin commented Sep 6, 2016

fgregg mannequin commented Jun 13, 2018

rhettinger commented Jun 13, 2018

fgregg mannequin commented Jun 15, 2018

rhettinger commented Aug 26, 2019

rhettinger commented Aug 26, 2019

serhiy-storchaka commented Sep 18, 2019

serhiy-storchaka commented Sep 18, 2019