This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: __contains__ and friends should check "is" for all elements first
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: jcea, jneb, rhettinger, wolma
Priority: normal Keywords:

Created on 2014-04-15 13:04 by jneb, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
containsDemo.py jneb, 2014-04-15 13:04 Demo of container lookup speed
Messages (8)
msg216286 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2014-04-15 13:04
It all started when adding an __equals__ method to a self made class.
The effect was that my program slowed down enormously, which came as a surprise.

This is because when doing an operation that does linear search through a container, the interpreter checks all items one by one, first checking identity and then equality.
If the __equals__ method is slow, this is slower than needed.

In python, you effectively get:
class myContainer:
  def __contains__(self, obj):
    for i in range(len(self)):
      if self[i] is obj: return True
      if self[i]==obj: return True
    return False

My proposal is to change this to:
class myContainer:
  def __contains__(self, obj):
    for i in range(len(self)):
      if self[i] is obj: return True
    for i in range(len(self)):
      if self[i]==obj: return True
    return False

The net effect would be approximately:
- if the object is exactly in the container, the speedup is significant.
- if the object is not in the container, there is no difference.
- if an object is in the container that is equal to the object, it will be slower, but not very much.

In the normal use case, this will probably feel to the user as a speed improvement, especially when the __equals__ is slow.

I tried to find cases in which this would change behaviour of programs, but I couldn't find any. If this _does_ change behaviour in some noticeable and unwanted way, let me know! (Documenting would also be a good idea, then.)

The accompanying file gives some timings to show what I mean, e.g.
Time in us to find an exact match (begin, middle, end):
0.042335559340708886 31.610660936887758 62.69573781716389
Time in us to find an equal match (begin, middle, end):
0.3730294263293299 31.421928646805195 63.177373531221896
Time if not found:
63.44531546956001
And now for an object that has no __equal__:
Time in us to find a thing (begin, middle, end):
0.03555453901338268 9.878883646121661 19.656711762284473
Time if not found:
19.676395048315776

(Python 2.7 does something completely different with objects without __equal__, so the test gives quite different results.)
msg216336 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-15 16:44
>> - if an object is in the container that is equal to the object, it will be slower, but not very much.

You don't know that in general. It depends on where in the sequence the equal object sits, and also on how expensive the equality check is compared to the identity check.

A simplified example over your rather lengthy one:

>>> l=list(range(20000000))
>>> any(e is 9000000 or e == 9000000 for e in l) # mimicking current behaviour
True
>>> any(e is 9000000 for e in l) or any(e == 9000000 for e in l) # your suggestion
True

The second example takes about twice as long as the first one because the identity check has to be run on the whole list, when the equality check discovers the match half way through the sequence.
Given that equality is usually more likely than identity, I guess, you would pay a price for your suggestion more often than you profit from it.
msg216362 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2014-04-15 19:00
Well, I partially agree. I see the following points:
Against my proposal:
- For *very* big containers, it can be slower in the case the object is early in the container, as you pointed out.
- Current behaviour is easier to understand.

OTOH, fore the proposal:
- Such giant containers are not efficient anyway; that's were set/Counter can help.
- The documentation doesn't promise anywhere the objects are scanned in order.

Anyway, if this is supposed to be the behaviour, I suggest to document it, and add the following recipe for people dealing with the same problem as I had:

from operator import ne
from itertools import repeat
class MyContainer:
  """container allowing equality search with containsEqual,
  while allowing fast identity search with __contains__:
  use "obj in c" to test if obj exactly sits in c
  use "c.containsEqual(obj)" to test if an object in c has c==obj
  """
  def containsEqual(self, object):
    return not all(map(ne, zip(repeat(object), self)))

  def __ne__(self, object):
    "Your not-equal test"

If you see a more elegant equivalent recipe, feel free to add.
msg216453 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-16 08:40
????

I don't even know where to start with this.

a) this recipe is not working
b) it's hardly readable
c) it is pointless

Why are you complicating things by testing for != ?
What advantage does this offer over == ?

You do not need class methods at all to achieve what you want (in fact the __ne__ as a method of the container is just wrong), instead use the one-liner:

any(element == value for element in container)

to find out if any element of your container equals value without doing the identity check, but then:
the identity check is anyway the fast part compared to the equality check (at least you assumed that in your first post).

and in fact with:
>>> l=list(range(20000000))

>>> 20000000 in l
False

is much faster than:

>>> any(e == 20000000 for e in l)
False

despite checking identity AND equality, simply because it isn't doing things in Python.

So while the current docs say this about the in operator:

"""
For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y).
"""

I guess, that doesn't mean it is actually implemented like that, but only that the result is equivalent.
msg216457 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2014-04-16 10:30
Oops. That was a hard lesson: 1) don't haste when posting 2) always run what you post.

The point was the trick to define a custom __ne__ and not an __eq__ for an object (not for the container it is in!) so you can use "in" at full speed. Then
not all(map(ne, repeat(obj), container))
or
not all(map(obj.__ne__, container))
can be used if you really what to check for equality. This does make a difference in my case, where I only sometimes check for a non-identical object in the container, and I know when I do that.
msg216463 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-16 13:20
that clarifies things, thanks.

I would still not usually go that way though as it means defining __ne__ with no accompanying __eq__, which means that, in a simple case, you can't use == on instances of your class and, in the case that your class inherits __eq__ from a parent, that == and != give inconsistent answers.

A much simpler solution is to not use the x in y idiom if you know it is slowed down by expensive equality checks in the elements of y and you're only interested in the identity check.
Simply replace it with

any(element is x for element in y)

, which will run at decent speed.

A quick illustration:

class myObj(object):
    def __eq__(self, other):
        for i in range(10000): pass # simulate an expensive check
        return False

l=[myObj() for x in range(10000)]

now compare:

>>> 1 in m # slowed down by myObj.__eq__
False

>>> any(e is 1 for e in m) # identity checks only
False

=> no class-level hacking required, but still a good performance gain.
Of course, if you really need bets performance with identity *and* equality checks, then your solution may make sense, but that looks like a pretty special requirement.
(and even then I would replace the ugly

not all(map(ne, repeat(obj), container)) # requires 2 imports to be so hard to read

with:

not all(element != obj for element in container)
)
msg216464 - (view) Author: Wolfgang Maier (wolma) * Date: 2014-04-16 13:26
> l=[myObj() for x in range(10000)]
> 
> now compare:
> 
> >>> 1 in m # slowed down by myObj.__eq__
> False
> 
> >>> any(e is 1 for e in m) # identity checks only
> False

oops, sorry for the inconsistency here.

the first line should read:

m = [myObj() for x in range(10000)]

for this to work, of course.
msg216631 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-04-17 01:03
>  when doing an operation that does linear search through a container,
> the interpreter checks all items one by one, first checking identity
> and then equality.

This is the guaranteed behavior.  Changing it would result in subtle change in semantics (i.e. an equality match early in a sequence would be bypassed in favor of a later identity match).

The current behavior also has favorable cache characteristics (i.e. each element is accessed exactly once) which provide benefits for long lists that cannot fit in L1 or L2 cache.

That said, I feel your pain.  Slow __eq__ tests are the bane of linear search.  I recommend using a hashed container for fast searching or that you use a tool like PyPy that gives significant speed-ups.
History
Date User Action Args
2022-04-11 14:58:01adminsetgithub: 65433
2014-04-29 23:43:48jceasetnosy: + jcea
2014-04-17 01:03:51rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg216631

versions: - Python 3.1, Python 3.2, Python 3.3, Python 3.4
2014-04-16 16:34:24rhettingersetassignee: rhettinger

nosy: + rhettinger
2014-04-16 13:26:08wolmasetmessages: + msg216464
2014-04-16 13:20:59wolmasetmessages: + msg216463
2014-04-16 10:30:57jnebsetmessages: + msg216457
2014-04-16 08:40:24wolmasetmessages: + msg216453
2014-04-15 19:00:16jnebsetmessages: + msg216362
2014-04-15 16:44:55wolmasetnosy: + wolma
messages: + msg216336
2014-04-15 13:04:23jnebcreate