Issue 21234: __contains__ and friends should check "is" for all elements first

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65433

classification

Title:	__contains__ and friends should check "is" for all elements first
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	jcea, jneb, rhettinger, wolma
Priority:	normal	Keywords:

Created on 2014-04-15 13:04 by jneb, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
containsDemo.py	jneb, 2014-04-15 13:04	Demo of container lookup speed

Messages (8)
msg216286 - (view)	Author: Jurjen N.E. Bos (jneb) *	Date: 2014-04-15 13:04
It all started when adding an __equals__ method to a self made class. The effect was that my program slowed down enormously, which came as a surprise. This is because when doing an operation that does linear search through a container, the interpreter checks all items one by one, first checking identity and then equality. If the __equals__ method is slow, this is slower than needed. In python, you effectively get: class myContainer: def __contains__(self, obj): for i in range(len(self)): if self[i] is obj: return True if self[i]==obj: return True return False My proposal is to change this to: class myContainer: def __contains__(self, obj): for i in range(len(self)): if self[i] is obj: return True for i in range(len(self)): if self[i]==obj: return True return False The net effect would be approximately: - if the object is exactly in the container, the speedup is significant. - if the object is not in the container, there is no difference. - if an object is in the container that is equal to the object, it will be slower, but not very much. In the normal use case, this will probably feel to the user as a speed improvement, especially when the __equals__ is slow. I tried to find cases in which this would change behaviour of programs, but I couldn't find any. If this _does_ change behaviour in some noticeable and unwanted way, let me know! (Documenting would also be a good idea, then.) The accompanying file gives some timings to show what I mean, e.g. Time in us to find an exact match (begin, middle, end): 0.042335559340708886 31.610660936887758 62.69573781716389 Time in us to find an equal match (begin, middle, end): 0.3730294263293299 31.421928646805195 63.177373531221896 Time if not found: 63.44531546956001 And now for an object that has no __equal__: Time in us to find a thing (begin, middle, end): 0.03555453901338268 9.878883646121661 19.656711762284473 Time if not found: 19.676395048315776 (Python 2.7 does something completely different with objects without __equal__, so the test gives quite different results.)
msg216336 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-15 16:44
>> - if an object is in the container that is equal to the object, it will be slower, but not very much. You don't know that in general. It depends on where in the sequence the equal object sits, and also on how expensive the equality check is compared to the identity check. A simplified example over your rather lengthy one: >>> l=list(range(20000000)) >>> any(e is 9000000 or e == 9000000 for e in l) # mimicking current behaviour True >>> any(e is 9000000 for e in l) or any(e == 9000000 for e in l) # your suggestion True The second example takes about twice as long as the first one because the identity check has to be run on the whole list, when the equality check discovers the match half way through the sequence. Given that equality is usually more likely than identity, I guess, you would pay a price for your suggestion more often than you profit from it.
msg216362 - (view)	Author: Jurjen N.E. Bos (jneb) *	Date: 2014-04-15 19:00
Well, I partially agree. I see the following points: Against my proposal: - For very big containers, it can be slower in the case the object is early in the container, as you pointed out. - Current behaviour is easier to understand. OTOH, fore the proposal: - Such giant containers are not efficient anyway; that's were set/Counter can help. - The documentation doesn't promise anywhere the objects are scanned in order. Anyway, if this is supposed to be the behaviour, I suggest to document it, and add the following recipe for people dealing with the same problem as I had: from operator import ne from itertools import repeat class MyContainer: """container allowing equality search with containsEqual, while allowing fast identity search with __contains__: use "obj in c" to test if obj exactly sits in c use "c.containsEqual(obj)" to test if an object in c has c==obj """ def containsEqual(self, object): return not all(map(ne, zip(repeat(object), self))) def __ne__(self, object): "Your not-equal test" If you see a more elegant equivalent recipe, feel free to add.
msg216453 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-16 08:40
???? I don't even know where to start with this. a) this recipe is not working b) it's hardly readable c) it is pointless Why are you complicating things by testing for != ? What advantage does this offer over == ? You do not need class methods at all to achieve what you want (in fact the __ne__ as a method of the container is just wrong), instead use the one-liner: any(element == value for element in container) to find out if any element of your container equals value without doing the identity check, but then: the identity check is anyway the fast part compared to the equality check (at least you assumed that in your first post). and in fact with: >>> l=list(range(20000000)) >>> 20000000 in l False is much faster than: >>> any(e == 20000000 for e in l) False despite checking identity AND equality, simply because it isn't doing things in Python. So while the current docs say this about the in operator: """ For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y). """ I guess, that doesn't mean it is actually implemented like that, but only that the result is equivalent.
msg216457 - (view)	Author: Jurjen N.E. Bos (jneb) *	Date: 2014-04-16 10:30
Oops. That was a hard lesson: 1) don't haste when posting 2) always run what you post. The point was the trick to define a custom __ne__ and not an __eq__ for an object (not for the container it is in!) so you can use "in" at full speed. Then not all(map(ne, repeat(obj), container)) or not all(map(obj.__ne__, container)) can be used if you really what to check for equality. This does make a difference in my case, where I only sometimes check for a non-identical object in the container, and I know when I do that.
msg216463 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-16 13:20
that clarifies things, thanks. I would still not usually go that way though as it means defining __ne__ with no accompanying __eq__, which means that, in a simple case, you can't use == on instances of your class and, in the case that your class inherits __eq__ from a parent, that == and != give inconsistent answers. A much simpler solution is to not use the x in y idiom if you know it is slowed down by expensive equality checks in the elements of y and you're only interested in the identity check. Simply replace it with any(element is x for element in y) , which will run at decent speed. A quick illustration: class myObj(object): def __eq__(self, other): for i in range(10000): pass # simulate an expensive check return False l=[myObj() for x in range(10000)] now compare: >>> 1 in m # slowed down by myObj.__eq__ False >>> any(e is 1 for e in m) # identity checks only False => no class-level hacking required, but still a good performance gain. Of course, if you really need bets performance with identity and equality checks, then your solution may make sense, but that looks like a pretty special requirement. (and even then I would replace the ugly not all(map(ne, repeat(obj), container)) # requires 2 imports to be so hard to read with: not all(element != obj for element in container) )
msg216464 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-04-16 13:26
> l=[myObj() for x in range(10000)] > > now compare: > > >>> 1 in m # slowed down by myObj.__eq__ > False > > >>> any(e is 1 for e in m) # identity checks only > False oops, sorry for the inconsistency here. the first line should read: m = [myObj() for x in range(10000)] for this to work, of course.
msg216631 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-04-17 01:03
> when doing an operation that does linear search through a container, > the interpreter checks all items one by one, first checking identity > and then equality. This is the guaranteed behavior. Changing it would result in subtle change in semantics (i.e. an equality match early in a sequence would be bypassed in favor of a later identity match). The current behavior also has favorable cache characteristics (i.e. each element is accessed exactly once) which provide benefits for long lists that cannot fit in L1 or L2 cache. That said, I feel your pain. Slow __eq__ tests are the bane of linear search. I recommend using a hashed container for fast searching or that you use a tool like PyPy that gives significant speed-ups.

History
Date	User	Action	Args
2022-04-11 14:58:01	admin	set	github: 65433
2014-04-29 23:43:48	jcea	set	nosy: + jcea
2014-04-17 01:03:51	rhettinger	set	status: open -> closed resolution: rejected messages: + msg216631 versions: - Python 3.1, Python 3.2, Python 3.3, Python 3.4
2014-04-16 16:34:24	rhettinger	set	assignee: rhettinger nosy: + rhettinger
2014-04-16 13:26:08	wolma	set	messages: + msg216464
2014-04-16 13:20:59	wolma	set	messages: + msg216463
2014-04-16 10:30:57	jneb	set	messages: + msg216457
2014-04-16 08:40:24	wolma	set	messages: + msg216453
2014-04-15 19:00:16	jneb	set	messages: + msg216362
2014-04-15 16:44:55	wolma	set	nosy: + wolma messages: + msg216336
2014-04-15 13:04:23	jneb	create