Issue 40889: Symmetric difference on dict_views is inefficient

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/85066

classification

Title:	Symmetric difference on dict_views is inefficient
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Dennis Sweeney, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2020-06-06 17:09 by rhettinger, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 20718	merged	Dennis Sweeney, 2020-06-08 12:05

Messages (11)
msg370839 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-06-06 17:09
Running "d1.items() ^ d2.items()" will rehash every key and value in both dictionaries regardless of how much they overlap. By taking advantage of the known hashes, the analysis step could avoid making any calls to __hash__(). Only the result tuples would need to hashed. Currently the code below calls hash for every key and value on the left and for every key and value on the right: >>> left = {1: -1, 2: -2, 3: -3, 4: -4, 5: -5, 6: -6, 7: -7} >>> right = {1: -1, 2: -2, 3: -3, 4: -4, 5: -5, 8: -8, 9: -9} >>> left.items() ^ right.items() # Total work: 28 __hash__() calls {(6, -6), (7, -7), (8, -8), (9, -9)} Compare that with the workload for set symmetric difference which makes zero calls to __hash__(): >>> set(left) ^ set(right) {6, 7, 8, 9} FWIW, I do have an important use case where this matters.
msg370933 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-06-07 23:41
But hashes of items are not known. Hashes of keys are known, hashes of values and items are not. We can add a special case for dict views in the set constructor and inline the hashing code for tuples, but it will be a lot of code for pretty rare case. And since hashes of values should be calculated in any case, and the significant time will be spent on allocating 2-tuples, the benefit of such optimization will be minor.
msg370954 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-06-08 04:19
It really depends on whether the key hashes are cheap or not.
msg370956 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2020-06-08 04:36
What about returning another dict_items instead of a set? As in (using the convention `d.items().mapping is d`): dict_items = type({}.items()) def __xor__(self: dict_items, other): if isinstance(other, dict_items): new = self.mapping.copy() MISSING = object() for key, value in other: existing = new.get(key, MISSING) if existing is MISSING: # (key, value) in (other - self) new[key] = value elif existing == value: # (key, value) in (self & other) del new[key] else: # (key, value) in (self - other) new[key] = value return new.items() else: return set(self) ^ set(other) I believe this wouldn't require any re-hashing. It would also allow for non-hashable values.
msg370957 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-06-08 04:52
> What about returning another dict_items instead of a set? That API ship has already sailed. The published API guarantees that a set is returned — there is no changing that. The question is whether we care enough to provide an optimized implementation that doesn't call __hash__() when the hash value has already been computed and stored.
msg370976 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2020-06-08 12:11
PR 20718 helps somewhat by only creating and hashing the tuples that wind up in the final set. Here's a benchmark: -m pyperf timeit -s "d1 = {i:i for i in range(100_000)}; d2 = {i:i\|1 for i in range(100_000)}" "d1.items() ^ d2.items()" Master: 37.9 ms +- 0.4 ms PR 20718: 25.5 ms +- 0.4 ms
msg371048 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2020-06-08 22:48
A demo: >>> class Int(int): ... hash_calls = 0 ... def __hash__(self): ... Int.hash_calls += 1 ... return super().__hash__() ... >>> left = {Int(1): -1, Int(2): -2, Int(3): -3, Int(4): -4, Int(5): -5, Int(6): -6, Int(7): -7} >>> right = {Int(1): -1, Int(2): -2, Int(3): -3, Int(4): -4, Int(5): -5, Int(8): -8, Int(9): -9} >>> Int.hash_calls = 0 >>> left.items() ^ right.items() {(9, -9), (7, -7), (8, -8), (6, -6)} >>> Int.hash_calls <Result: 14 on Master, but only 4 on PR 20718> It looks like the same trick (searching by key and comparing values before maybe constructing a 2-tuple) might give similar performance improvements for dict_items.__or__, dict_items.__and__, and dict_items.__sub__. Is it worth discussing these other operators in this issue?
msg371082 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-06-09 08:17
So it needs to add 100 lines of C code to speed up a pretty uncommon operation for arguments of a particular type.
msg371139 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-06-09 20:31
To my eyes, the patch looks somewhat tame. It is readable enough and doesn't do anything tricky. The set object implementation aimed to never recompute the hash when it was already known. It is reasonable that other set-like operations share that aspiration.
msg371161 - (view)	Author: Inada Naoki (methane) *	Date: 2020-06-10 05:57
New changeset 07d81128124f2b574808e33267c38b104b42ae2a by Dennis Sweeney in branch 'master': bpo-40889: Optimize dict.items() ^ dict.items() (GH-20718) https://github.com/python/cpython/commit/07d81128124f2b574808e33267c38b104b42ae2a
msg371167 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-06-10 06:50
I don't like a tendency of optimizing very uncommon cases. We can optimize every operation for specific types by inlining the code. But it increases maintaining cost and can have negative net effect on performance: because increasing the code size and adding additional runtime checks. In past we usually rejected such kind of patches.

History
Date	User	Action	Args
2022-04-11 14:59:32	admin	set	github: 85066
2020-06-10 06:50:59	serhiy.storchaka	set	messages: + msg371167
2020-06-10 05:57:30	methane	set	status: open -> closed nosy: - methane resolution: fixed stage: patch review -> resolved
2020-06-10 05:57:07	methane	set	nosy: + methane messages: + msg371161
2020-06-09 20:31:50	rhettinger	set	messages: + msg371139
2020-06-09 08:17:52	serhiy.storchaka	set	messages: + msg371082
2020-06-08 22:48:32	Dennis Sweeney	set	messages: + msg371048
2020-06-08 12:11:57	Dennis Sweeney	set	messages: + msg370976
2020-06-08 12:05:52	Dennis Sweeney	set	keywords: + patch stage: patch review pull_requests: + pull_request19928
2020-06-08 04:52:44	rhettinger	set	messages: + msg370957
2020-06-08 04:36:28	Dennis Sweeney	set	messages: + msg370956
2020-06-08 04:19:17	rhettinger	set	messages: + msg370954
2020-06-07 23:41:30	serhiy.storchaka	set	messages: + msg370933
2020-06-07 22:39:12	rhettinger	set	nosy: + Dennis Sweeney
2020-06-06 17:09:59	rhettinger	create