classification
Title: Misleading membersip expression documentation
Type: behavior Stage:
Components: Documentation Versions: Python 3.10
process
Status: open Resolution: third party
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, eric.araujo, eric.smith, ezio.melotti, harahu, mdk, rhettinger, serhiy.storchaka, willingc
Priority: normal Keywords:

Created on 2021-11-17 19:00 by harahu, last changed 2021-11-20 12:41 by harahu.

Messages (7)
msg406485 - (view) Author: Harald Husum (harahu) Date: 2021-11-17 19:00
https://docs.python.org/3/reference/expressions.html#membership-test-operations

> For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression `x in y` is equivalent to `any(x is e or x == e for e in y)`.

Yet:

```py
import pandas as pd
import numpy as np

pd_0_dt = pd.Timedelta(0)
np_0_dt = np.timedelta64(0)

cm = (pd_0_dt, np_0_dt)

d1 = {np_0_dt: pd_0_dt}
d2 = {pd_0_dt: np_0_dt}

def test_membership_doc_claim(candidate_members, dct):
    for m in candidate_members:
        if m in dct:
            assert any(m is e or m == e for e in dct)

        if any(m is e or m == e for e in dct):
            assert m in dct

if __name__ == "__main__":
    test_membership_doc_claim(cm, d1) # Fails
    test_membership_doc_claim(cm, d2) # Fails

```

Not too surprised, given the td.__hash__() implementation differs between these classes, but they are considered equal none the less.

Unsure whether it is the dict implementation or the doc claim that needs to budge here.
msg406490 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-11-17 20:29
This section presumes that the usual hash invariant holds:  a==b implies hash(a)==hash(b).  We could repeat that here but I don't think it makes the docs better or more useable to require that docs repeat the same facts in multiple places.

Alternatively, the sentence could be split to cover both cases:

"""
For sequence container types such as list, tuple, or collections.deque,
the expression `x in y` is equivalent to `any(x is e or x == e for e in y)`.
For container that use hashing, such as dict, set, or frozenset, 
the expression `x in y` is equivalent to `any(x is e or x == e for e in y if hash(x) == hash(e))`.
"""

While that is more precise, it borders on being pedantic and likely doesn't make the average reader better off.

Consider submitting a feature request to pandas suggesting that they harmonize their hash functions with their counterparts in numpy.
msg406497 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2021-11-17 21:45
I don’t think repeating the hash invariant in multiple places adds anything, I think it would just add clutter. I also think the existing docs are easier to understand than the version with the hashing containers split out.
msg406502 - (view) Author: Harald Husum (harahu) Date: 2021-11-17 22:33
Might i then suggest mentioning in the docs that this assumes the invariance, combined with a backlink to the definition, instead of a full repeat?

I'm sure the hash invariant is well known to you guys, working on python, but I was genuinely surprised by this behaviour.

Not sure I should be ashamed or not, given that my job has revolved around python for the last couple of years.
msg406526 - (view) Author: Harald Husum (harahu) Date: 2021-11-18 09:40
I am realising that me not knowing about the hash invariance is likely a symptom of something I have in common with most Python users, but not with Python maintainers: Having access to a powerful ecosystem, we mostly get our classes from 3rd parties, rather than implement them ourselves. When I do define my own classes, I usually don't have to touch the `__hash__` or `__eq__` implementations, since I am either subclassing, making a plain dataclass, or leaning on `attrs` to help me out. I think it is telling that even the pandas core devs are able to mess this up, and it suggests to me that this invariance isn't emphasised enough.

Here's a go at specifying what I mean with a backlink:

"""
For sequence container types such as list, tuple, or collections.deque,
the expression `x in y` is equivalent to `any(x is e or x == e for e in y)`.
For container that use hashing, such as dict, set, or frozenset, 
the same equivalence holds, assuming the [hash invariance](https://docs.python.org/3/glossary.html#term-hashable).
"""

I just derived this more or less directly from Hettinger's formulation. It could probably be made clearer.

I am realising that this, (famous, it seems), hash invariance isn't defined in isolation anywhere, making it slightly hard to link to. Any better suggestions than the glossary entry for hashable, which has the definition included? To me, it seems that such a fundamental assumption/convention/requirement, that isn't automatically enforced, should be as easy as possible to point to.

In my search for the definition (prompted by Hettinger) i discovered more surprised, by the way.

Surprise 1:
https://docs.python.org/3/library/collections.abc.html?highlight=hashable#collections.abc.Hashable

> ABC for classes that provide the __hash__() method.

Having now discovered the mentioned invariance, I am surprised this isn't explicitly formulated (and implemented? haven't checked) as:

"""
ABC for classes that provide the __hash__() and __eq__() methods.
"""

I also think this docstring deserves a backlink to the invariance definition, given it's importance, and how easy it is to shoot yourself in the foot. The current formulation of this docstring actually reflected what I (naively) assumed it meant to be hashable, suggesting this is the place in the docs I got my understanding of the term from.

Surprise 2:
https://docs.python.org/3/reference/expressions.html?highlight=hashable#value-comparisons

> The `hash()` result should be consistent with equality. Objects that are equal should either have the same hash value, or be marked as unhashable.

I appreciate that this is mentioned in this section (I was hoping to find it). But it feels like a reiteration of the definition of the invariant, and could thus be replaced with a backlink, like suggested above. I'd much rather see the text real estate be used for a motivating statement (you do't want weird behaviour in sets and dicts), and a reminder of the importance of checking the __hash__ implementation if you are modifying the __eq__ implementation, in, say, some subclass.

Surprise 3:
https://docs.python.org/3/reference/datamodel.html#object.__eq__

> See the paragraph on __hash__() for some important notes on creating hashable objects which support custom comparison operations and are usable as dictionary keys.

Another case of the invariance being mentioned (I appreciate it), but in a way where it isn't directly evident that extreme care should be taken when modifying an __eq__ implementation. Perhaps another case where the invariance should be referred to by link, and the text should focus on the consequences of breaking it.

Surprise 4:
https://docs.python.org/3/reference/datamodel.html#object.__hash__

Another definition-in-passing of the invariance:

> The only required property is that objects which compare equal have the same hash value.

Also replaceable by backlink?

There after follows descriptions of some, (in hindsight very important), protection mechanisms.

> User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).

> A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None.

But yet again, without some motivating statement for why we care about the invariance, all of this seems, well, surprising and weird.

Surprise 5:
https://docs.python.org/3/library/functions.html#hash

Perhaps another location where a backlink would be in order, although not sure in this case.
msg406649 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-11-20 09:48
I think there is a bug in either numpy.timedelta64.__hash__ or panda.Timedelta.__hash__. Please report it to corresponding libraries.

It is not related to the documentation of membership test. If you have different hashes for equal objects most dict and set operations will be broken.

I suggest to close this issue as "third party issue".
msg406651 - (view) Author: Harald Husum (harahu) Date: 2021-11-20 12:41
Serhiy, thanks for responding.

I agree on this being "third-party" when looking at the original post I made, and I've reported the issue to the relevant party (https://github.com/pandas-dev/pandas/issues/44504). But really, the bigger issue here, in my opinion, is that the equality-hash invariance is poorly documented at best, and that this poor documentation is the likely "root cause" of the third-party issue I described.

If you read further up the thread, you can see that I have some (pretty concrete) proposals for how to improve on the current situation. I'd like the documentation maintainers to consider them, in whole or in part. If you feel it is more appropriate, I can split it out into a new issue, but I wouldn't mind having a conversation about it here.

The Python ecosystem will benefit as a whole if we keep an open mind as to how such (fundamental) issues can creep into commonly used, infrastructure-critical libraries, developed and maintained by excellent engineers.
History
Date User Action Args
2021-11-20 12:41:47harahusetmessages: + msg406651
2021-11-20 09:48:45serhiy.storchakasetresolution: third party

messages: + msg406649
nosy: + serhiy.storchaka
2021-11-20 01:25:56harahusetversions: + Python 3.10, - Python 3.8
2021-11-18 09:40:24harahusetmessages: + msg406526
2021-11-17 22:33:15harahusetmessages: + msg406502
2021-11-17 21:45:03eric.smithsetnosy: + eric.smith
messages: + msg406497
2021-11-17 20:29:30rhettingersetnosy: + rhettinger
messages: + msg406490
2021-11-17 19:00:34harahucreate