classification
Title: Which operand is preferred by set operations? Missing information in the documentation
Type: enhancement Stage:
Components: Documentation Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Balakrishnan.B, Giacomo.Alzetta, docs@python, r.david.murray, rhettinger, steven.daprano, stutzbach, terry.reedy
Priority: normal Keywords: easy

Created on 2014-03-12 21:55 by Giacomo.Alzetta, last changed 2014-03-15 20:52 by rhettinger. This issue is now closed.

Messages (9)
msg213305 - (view) Author: Giacomo Alzetta (Giacomo.Alzetta) Date: 2014-03-12 21:55
Currently the documentation for set (at: http://docs.python.org/2/library/stdtypes.html#set) does not mention which operand is preferred when performing the usual binary operations.

For example the following sample code doesn't have a defined result according to the documentation:


class MyObj:
  def __init__(self, key, value):
     self.key = key
     self.value = value
  def __key(self):
    return self.key
  def __hash__(self):
    return hash(self.__key())
  def __eq__(self, other):
    return type(self) is type(other) and self.__key() == other.__key()

a = {MyObj(1, 'a')}

b = {MyObj(1, 'b')}

print((a & b).pop().value)  # 'a' or 'b'?


As far as I can tell currently there is no rule about this. Intersection prefers the second operand, while union prefers the first. These are the only operations where this is important since they include common elements.

I believe that this kind of information ought to be included explicitly near such operations. At the very least, if the behaviour should really be somehow undefined, it should be stated there so that people don't rely on such behaviour.

The same should be stated for |= and &=, since the first will not modify common elements while the latter will.


PS: I can't find any resource about the formatting of issues (e.g. if and which syntax is used for code blocks. Sorry for that.
msg213309 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-12 22:14
If you take the intersection or union of a set, it shouldn't matter which set the element "came from", by the axioms that apply to sets.  So it if does, that's an application bug, IMO.
msg213312 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-12 22:29
Or, to put it another way, which set it comes from is not documented because it is an implementation detail and can not be depended on.

I'm sure someone will correct me if I'm wrong :)
msg213320 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-03-12 22:46
If this is undefined, and I think it should be, the docs should explicitly say so. How is this?

The union and intersection operations select elements which appear in both operands, e.g. set([a, b, c]) & set([c, d, e]) returns set([c]). The selection is performed by equality, not object identity, and which specific object is returned (the c object from the first set or the second set) is an implementation detail and cannot be relied on.
msg213324 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-12 23:11
Sounds good to me.
msg213360 - (view) Author: Giacomo Alzetta (Giacomo.Alzetta) Date: 2014-03-13 07:55
I asked this because, for example, in Haskell it *is* a well-defined behaviour (see its documentation at: http://hackage.haskell.org/package/containers-0.5.4.0/docs/Data-Set.html): the left operand is preferred by all operations.

In fact, working with Haskell, I also have used such behaviour in the past. For example when writing a simple type-checker it's quite convenient to use such behaviour when handling environments where you want inner blocks to hide outer variables. Not defining such behaviour means that you must re-implement the wheel in order to achieve the correct behaviour.

In any case, this information should be made explicit in the docs, whether we want to make the behaviour defined or not.
msg213495 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-03-13 23:49
> As far as I can tell currently there is no rule about this. 
> Intersection prefers the second operand, while union prefers the first.

The implementation uses the same logic as found in Lib/sets.py where the intersection operator loops over the SMALLER of the two input sets and does membership testing over the LARGER set.

> In any case, this information should be made explicit in the docs

Not really.  Historically, we've resisted the urge to over-specify or to declare something as undefined.  In general, we make affirmative guarantees when they are useful and when we're prepared to make sure the guarantee will always hold (for example, it took a long while before sorts were guaranteed to be stable).

Another consideration is that for most users, additional notes of "behavior xxx is undefined" is confusing, disconcerting, distracting, and rarely helpful.  Language lawyers tend to request this sort of wording out of some sense of completeness, but it doesn't actually make the docs better for users.
msg213602 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-03-14 23:56
Defining equality to ignore the .value attribute (and the id), says that they *do not matter*. Python believes you and the interpreter does what it does. If we had made a 'first or second operand wins' claim, we would not have been able to optimize intersection by starting with the smaller set. I may mis-remember but I believe we changed after someone noticed that something like {1] & set(range(10000)) versus set(range(10000)) & {1} had very different run times.

I agree with Raymond that the docs for intersection and union should be left as is.

I could read the doc for A.intersection_update(B) as saying that it keeps all items in A that also appear in B, with appear defined on the basis of ==, so that the result is a subset of the actual items in A. If that is the intent, it could be made clearer.

The doc A.update(B) (union) come close to saying that all items in A remain and items in B not in A get added, so that the result all the items in A. Again, if that is the intent, it could be made explicit, and people who want to determine which operand wins could use the update functions to do so.
msg213684 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-03-15 20:52
Thanks Terry.  

And yes, your reading of the set.update() docs is correct, "Update the set, adding elements from all others" means that it updates the set by adding the elements from the other sets.

FWIW, getting into details about "which value wins" goes against the core concept of the data structure.  Sets (in a mathematical sense) are about treating elements of an equivalence class as being identical.  We intentionally treat {1, 1.1} as being of length one and equal to {1.1, 1} eventhough the integer 1 and the float 1.0 are actually distinguishable in ways not tested by equality.

This isn't just a concept with sets.  The dict.update() method works similarly as does other container operations that depend on a notion of equivalence that is independent of other distinctive traits (object identity, attached values, type, etc).
History
Date User Action Args
2014-03-15 20:52:22rhettingersetstatus: open -> closed
resolution: not a bug
messages: + msg213684
2014-03-14 23:56:39terry.reedysetnosy: + terry.reedy
messages: + msg213602
2014-03-13 23:49:36rhettingersetmessages: + msg213495
2014-03-13 23:33:13rhettingersetassignee: docs@python -> rhettinger
2014-03-13 07:55:40Giacomo.Alzettasetmessages: + msg213360
2014-03-13 02:54:06benjamin.petersonsetkeywords: + easy
2014-03-13 01:16:10Balakrishnan.Bsetnosy: + Balakrishnan.B
2014-03-12 23:11:33r.david.murraysetmessages: + msg213324
2014-03-12 22:46:20steven.dapranosetnosy: + steven.daprano
messages: + msg213320
2014-03-12 22:29:58r.david.murraysetnosy: + rhettinger
messages: + msg213312
2014-03-12 22:18:36eric.araujosetnosy: + stutzbach
2014-03-12 22:14:12r.david.murraysetnosy: + r.david.murray
messages: + msg213309
2014-03-12 21:55:00Giacomo.Alzettacreate