This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Worst-case behaviour of hash collision with float NaN
Type: performance Stage: resolved
Components: Interpreter Core, Library (Lib) Versions: Python 3.11, Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: congma, cwg, mark.dickinson, miss-islington, realead, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2021-03-11 14:55 by congma, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
nan_key.py congma, 2021-03-11 14:55 PoC for NaN-collisions and possible fix for worst-case
Pull Requests
URL Status Linked Edit
PR 25493 merged rhettinger, 2021-04-21 03:13
PR 26679 merged serhiy.storchaka, 2021-06-11 18:22
PR 26706 merged miss-islington, 2021-06-13 13:35
PR 26725 merged mark.dickinson, 2021-06-14 18:06
PR 26743 merged mark.dickinson, 2021-06-15 18:51
Messages (40)
msg388508 - (view) Author: Cong Ma (congma) * Date: 2021-03-11 14:55
Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour for dict if numerous existing keys are NaN. I think by hashing NaN using the generic object (or "pointer") hash instead, the worst-case situation can be alleviated without changing the semantics of either dict or float. However, this also requires changes to how complex and Decimal objects hash, and moreover incompatible change to sys.hash_info. I would like to hear how Python developers think about this matter.

--------

Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in Include/pyhash.h) for the hash value of all floating point NaNs. The value is part of the sys.hashinfo API and is re-used by complex and Decimal in computing its hash in accordance with Python builtin-type documentation. [0]

(The doc [0] specifically says that "[a]ll hashable nans have the same hash value.")

This is normally not a great concern, except for the worst case performance. The problem is that, since they hash to the same value and they're guaranteed to compare unequal to any compatible numeric value -- not even to themselves, this means they're guaranteed to collide.

For this reason I'd like to question whether it is a good idea to have all hashable NaNs with the same hash value.

There has been some discussions about this over the Web for some time (see [1]). In [1] the demo Python script times the insertion of k distinct NaN keys (as different objects) into a freshly created dict. Since the keys are distinct and are guaranteed to collide with each other (if any), the run time of a single lookup/insertion is roughly linear to the existing number of NaN keys. I've recreated the same script using with a more modern Python (attached).

I'd suggest a fix for this worst-case behaviour: instead of returning the hash value 0 for all NaNs, use the generic object (pointer) hash for these objects. As a PoC (also in the attached script), it roughly means

```
class myfloat(float):
    def __hash__(self):
        if self != self:  # i.e., math.isnan(self)
            return object.__hash__(self)
        return super().__hash__(self)
```

This will
- keep the current behaviour of dict intact;
- keep the invariant `a == b implies hash(a) == hash(b)` intact, where applicable;
- uphold all the other rules for Python numeric objects listed in [0];
- make hash collisions no more likely than with object() instances (dict lookup time is amortized constant w.r.t. existing number of NaN keys).

However, it will
- not keep the current rule "All hashable nans have the same hash value";
- not be compatible with the current sys.hash_info API (requires the removal of the "nan" attribute from there and documenting the change);
- require congruent modifications in complex and Decimal too.

Additionally, I don't think this will affect module-level NaN "constants" such as math.nan and how they behave. The "NaN constant" has never been a problem to begin with. It's only the *distinct* NaN objects that may cause the worst-case behaviour.

--------

Just for the record I'd also like to clear some outdated info or misconception about NaN keys in Python dicts. It's not true that NaN keys, once inserted, cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if you keep the original key *object* around by keeping a reference to it (or obtaining a new one from the dict by iterating over it). This, I think, is because Python dict compares for object identity before rich-comparing for equality in `lookdict()` in Objects/dictobject.c, so this works for `d = dict()`:

```
f = float("nan")
d[f] = "value"
v = d[f]
```

but this fails with `KeyError`, as it should:

```
d[float("nan")] = "value"
v = d[float("nan")]
```

In this regard the NaN float object behaves exactly like the object() instance as keys -- except for the collisions. That's why I think at least for floats the "object" hash is likely to work. The solution using PRNG [1] (implemented with the Go language) is not necessary for CPython because the live objects are already distinct.

--------

Links:

[0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types
[1] https://research.swtch.com/randhash
[2] https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/
msg388511 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-11 17:35
Sigh. When I'm undisputed ruler of the multiverse, I'm going to make "NaN == NaN" return True, IEEE 754 be damned. NaN != NaN is fine(ish) at the level of numerics; the problems start when the consequences of that choice leak into the other parts of the language that care about equality. NaNs just shouldn't be considered "special enough to break the rules" (where the rules here are that the equivalence relation being used as the basis of equality for a general container should actually *be* an equivalence relation - reflexive, symmetric, and transitive).

Anyway, more constructively ...

I agree with the analysis, and the proposed solution seems sound: if we're going to change this, then using the object hash seems like a workable solution. The question is whether we actually do need to change this.

I'm not too worried about backwards compatibility here: if we change this, we're bound to break *someone's* code somewhere, but it's hard to imagine that there's *that* much code out there that makes useful use of the property that hash(nan) == 0. The most likely source of breakage I can think of is in test code that checks that 3rd-party float-like things (NumPy's float64, or gmpy2 floats, or ...) behave like Python's floats.

@Cong Ma: do you have any examples of cases where this has proved, or is likely to prove, a problem in real-world code, or is this purely theoretical at this point?

I'm finding it hard to imagine cases where a developer *intentionally* has a dictionary with several NaN values as keys. (Even a single NaN as a key seems a stretch; general floats as dictionary keys are rare in real-world code that I've encountered.) But it's somewhat easier to imagine situations where one could run into this accidentally. Here's one such:

>>> import collections, numpy as np
>>> x = np.full(100000, np.nan)
>>> c = collections.Counter(x)

That took around 4 minutes of non-ctrl-C-interruptible time on my laptop. (I was actually expecting it not to "work" as a bad example: I was expecting that NaNs realised from a NumPy array would all be the same NaN object, but that's not what happens.) For comparison, this appears instant:

>>> x = np.random.rand(100000)
>>> c = collections.Counter(x)

And it's not a stretch to imagine having a NumPy array representing a masked binary 1024x1024-pixel image (float values of NaN, 0.0 and 1.0) and using Counter to find out how many 0s and 1s there were.

On the other hand, we've lived with the current behaviour for some decades now without it apparently ever being a real issue.

On balance, I'm +1 for fixing this. It seems like a failure mode that would be rarely encountered, but it's quite unpleasant when it *is* encountered.
msg388512 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-11 17:57
Hmm. On second thoughts, the proposed solution wouldn't actually *help* with the situation I gave: the elements (lazily) realised from the NumPy array are highly likely to all end up with the same address in RAM. :-(

>>> x = np.full(10, np.nan)
>>> for v in x:
...     print(id(v))
...     del v
... 
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
4601757008
msg388513 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-11 17:58
On third thoughts, of course it *would* help, because the Counter is keeping references to the realised NaN values. I think I'll go away now and come back when my brain is working.
msg388514 - (view) Author: Cong Ma (congma) * Date: 2021-03-11 18:57
Thank you @mark.dickinson for the detailed analysis.

In addition to your hypothetical usage examples, I am also trying to understand the implications for user code.

If judging by the issues people open on GitHub like this: https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run into the "NaN as key" problem every now and then, I guess. (I'm not familiar enough with the poster's real problem in the Pandas setting to comment about that one, but it seems to suggest people do run into "real world" problems that shares some common features with this one). There's also StackOverflow threads like this (https://stackoverflow.com/a/17062985) where people discuss hashing a data table that explicitly use NaN for missing values. The reason seems to be that "[e]mpirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation." (https://arxiv.org/pdf/0902.2206.pdf).

I cannot imagine whether the proposed change would make life easier for people who really want NaN keys to begin with. However I think it might reduce the exposure to worst-case performances in dict (and maybe set/frozenset), while not hurting Python's own self-consistency.

Maybe there's one thing to consider about future compatibility though... because the proposed fix depends on the current behaviour that floats (and by extension, built-in float-like objects such as Decimal and complex) are not cached, unlike small ints and interned Unicode objects. So when you explicitly construct a NaN by calling float(), you always get a new Python object. Is this guaranteed now on different platforms with different compilers? I'm trying to parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, where it seems to me that for certain configs it could be a `static const union` translation-unit-level constant. Does this affect the code that actually builds a Python object from an underlying NaN? (I apologize if this is a bad question). But if it doesn't and we're guaranteed to really have Python NaN-objects that don't alias if explicitly constructed, is this something unlikely to change in the future?

I also wonder if there's security implication for servers that process user-submitted input, maybe running a float() constructor on some string list, checking exceptions but silently succeeding with "nan": arguably this is not going to be as common an concern as the string-key collision DoS now foiled by hash randomization, but I'm not entirely sure.

On "theoretical interest".. yes theoretical interests can also be "real world" if one teaches CS theory in real world using Python, see https://bugs.python.org/issue43198#msg386849

So yes, I admit we're in an obscure corner of Python here. It's a tricky, maybe-theoretical, but seemingly annoying but easy-to-fix issue..
msg388519 - (view) Author: Cong Ma (congma) * Date: 2021-03-11 19:28
Sorry, please ignore my rambling about "float() returning aliased object" -- in that case the problem with hashing doesn't arise.
msg388522 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-11 19:54
> I also wonder if there's security implication for servers that process user-submitted input

Yes, the "malicious actor" scenario is another one to consider. But unlike the string hashing attack, I'm not seeing a realistic way for the nan hash collisions to be used in attacks, and I'm content not to worry about that until someone gives an actual proof of concept. Many of Python's hash functions are fairly predictable (by design!) and there are already lots of other ways to deliberately construct lots of hash collisions with non-string non-float values.
msg388579 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-03-13 01:12
Idea:  We could make this problem go away by making NaN a singleton.
msg388603 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-13 10:34
> We could make this problem go away by making NaN a singleton.

That could work, though we'd annoy anyone who cared about preserving NaN payloads and signs in Python. I don't know if such people exist. Currently the sign is accessible through things like math.copysign, and the payload through struct manipulations (though on most platforms we still don't see the full range of NaNs: signalling NaNs are quietly silenced on my machine).

We'd also want to do some performance checks: the obvious way to do this would be to have an "is_nan" check in PyFloat_FromDouble. I'd *guess* that a typical CPU would do a reasonable job of branch prediction on that check so that it had minimal impact in normal use, but only timings will tell for sure.
msg388604 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-13 11:16
> signalling NaNs are quietly silenced on my machine

I just tested this assertion, and it appears that I was lying: this doesn't seem to be true. I'm almost *sure* it used to be true, and I'm not sure what changed, and when. (FWIW: Python 3.9.2 from macPorts on macOS 10.14.6 + Intel hardware. Earlier versions of Python on the same machine give similar results to those below, so it's not a core Python change.)

Here's an attempt to create an snan Python float:

Python 3.9.2 (default, Feb 28 2021, 13:47:30) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import struct
>>> snan_bits = 0x7ff0000000123456
>>> snan = struct.unpack("<d", struct.pack("<Q", snan_bits))[0]
>>> snan
nan

Converting back shows that the attempt was successful: the payload, including the signalling bit (bit 51, counting with the lsb as bit 0), was preserved:

>>> hex(struct.unpack("<Q", struct.pack("<d", snan))[0])
'0x7ff0000000123456'

As expected, doing any arithmetic on the value converts the signalling nan to the corresponding quiet nan (so that the leading 16 bits become 0x7ff8 instead of 0x7ff0), but preserves the rest of the payload.

>>> hex(struct.unpack("<Q", struct.pack("<d", 0.0 + snan))[0])
'0x7ff8000000123456'
msg388605 - (view) Author: Cong Ma (congma) * Date: 2021-03-13 11:23
> Idea:  We could make this problem go away by making NaN a singleton.

Apart from the fact that NaN is not a specific value/object (as pointed out in the previous message by @mark.dickinson), currently each builtin singleton (None, True, False, etc.) in Python satisfies the following predicate:

`if s is a singleton, then s == s`

This is also satisfied by "flyweight" objects such as small ints.

It feels natural and unsurprising to have this, if not officially promised. Requiring NaN to be a singleton would violate this implicit understanding about singletons, and cause another surprise on top of the other surprises with NaNs (or with rich comparison in general).

Performance-wise, I think the current behaviour (returning _PyHASH_NAN == 0) is already nested inside two layers of conditionals in `_Py_HashDouble()` in Python/pyhash.c:

```
    if (!Py_IS_FINITE(v)) {
        if (Py_IS_INFINITY(v))
            return v > 0 ? _PyHASH_INF : -_PyHASH_INF;
        else
            return _PyHASH_NAN;
    }
```
(v is the underlying C `double`, field `ob_fval` of `PyFloatObject`).

I don't feel performance would be hurt by rewriting `float_hash()` in Objects/floatobject.c to the effect of

```
    if (!Py_IS_NAN(v->ob_fval)) {
        return _Py_HashDouble(v->ob_fval)
    }
    else {
        return _Py_HashPointer(v);
    }
```
and simplify the conditionals in `_Py_HashDouble()`. But of course, only real benchmarking could tell. (Also, other float-like types would have to be adjusted, too).
msg388608 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-03-13 11:32
@Cong Ma: Yes, I'm not worried about performance for the change you're proposing (though as you say, we should benchmark anyway). The performance thoughts were motivated by the idea of making NaN a singleton: adding a check to PyFloat_FromDouble would mean that almost every operation that produced a float would have to pass through that check.
msg388702 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-03-15 02:53
> The performance thoughts were motivated by the idea of
> making NaN a singleton: adding a check to 
> PyFloat_FromDouble would mean that almost every operation
> that produced a float would have to pass through that check.

It may suffice to move the check upstream from PyFloat_FromDouble so that float('NaN') alway produces identically the same object as math.nan.

That would handle the common cases where NaN is used for missing values or is generated from string conversions.  We don't need a bullet-proof solution, just mitigation of harm.
msg388717 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-03-15 09:36
What about Decimal NaN? Even if make float NaN a singleton, there will be other NaNs.

And making float('nan') returning a singleton, but 1e1000 * 0 returning different NaN would cause large confusion.
msg388748 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-03-15 15:49
> And making float('nan') returning a singleton, 
> but 1e1000 * 0 returning different NaN would cause large confusion.

Not really, it would be just be an implementation detail, no different than int and strings being sometimes unique and sometimes not.  Historically, immutable objects are allowed to be reused when it is convenient for the implementation.

> What about Decimal NaN?

Decimal isn't used as much, so the need is less pressing, but we can do whatever is allowed by the spec.  Presumably, that would be easier than with floats because we control all possible ways to construct Decimals.
msg390698 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-04-10 07:14
Mark, is there any reason hash(float('NaN')) and hash(Decimal('NaN')) have to be a constant?  Since NaNs don't compare equal, the hash homomorphism has no restrictions.  Why not have hash() return the id() like we do for instances of object?

I understand that sys.hash_info.nan would be invalidated, but that was likely useless anyway.
msg390701 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-04-10 07:52
> Why not have hash() return the id() like we do for instances of object?

I think that's fine, and IIUC that's what Cong Ma was proposing. It seems like the least invasive potential fix.

In principle I find the idea of making NaN a singleton rather attractive - the performance hit is likely negligible, and it solves a bunch of subtle NaN-related issues. (For example, there's still a proposal to provide an IEEE 754 total_ordering key so that lists of floats can be ordered sanely in the presence of nans, but even if you use that you don't know whether your nans will end up at the start or the end of the list, and you could potentially have nans in both places; fixing a single nan and its sign bit would fix that.) But there are bound to be some people somewhere making use of the NaN payload and sign bit, however inadvisedly, and Serhiy's right that this couldn't be extended to Decimal, where the sign bit and payload of the NaN are mandated by the standard.
msg390767 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-04-11 04:55
I agree hashing a NaN acting like the generic object hash (return rotated address) is a fine workaround, although I'm not convinced it addresses a real-world problem ;-) But why not? It might.

But that's for CPython. I'm loathe to guarantee anything about this in the language itself. If an implementation wants `__contains__()` tests to treat all NaNs as equal instead, or consider them equal only if "is" holds, or never considers them equal, or resolves equality as bitwise representation equality ... all are fine by me. There's no truly compelling case to made for any of them, although "never considers them equal" is least "surprising" given the insane requirement that NaNs never compare equal under IEEE rules, and that Python-the-language doesn't formally support different notions of equality in different contexts.
msg390770 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-04-11 05:27
> I'm loathe to guarantee anything about this in the language itself. 

There aren't language any guarantees being proposed.  Letting the hash depend on the object id just helps avoid quadratic behavior.  Making float('NaN') a singleton is also perfectly reasonable behavior for an immutable type.  Neither is a guaranteed behavior, just a convenient one.


> I'm not convinced it addresses a real-world problem

Maybe yes, maybe no.  I would hope that NaNs arising from bogus calculations would be rare.  OTOH, they are commonly used for missing values in Pandas where internal dict/set operations abound.  Either way, I would like to close off a trivially easy way to invoke quadratic behavior unexpectedly.
msg391605 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-04-22 15:35
New changeset a07da09ad5bd7d234ccd084a3a0933c290d1b592 by Raymond Hettinger in branch 'master':
bpo-43475:  Fix worst case collision behavior for NaN instances (GH-25493)
https://github.com/python/cpython/commit/a07da09ad5bd7d234ccd084a3a0933c290d1b592
msg391606 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-04-22 15:40
Thanks, Raymond. I'll open something on the NumPy bug tracker, too, since they may want to consider doing something similar for numpy.float64 and friends.
msg391607 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-04-22 15:57
Opened https://github.com/numpy/numpy/issues/18833
msg395643 - (view) Author: (realead) Date: 2021-06-11 15:51
This change makes life harder for people trying to get sane behavior with sets for float-, complex- and tuple-objects containing NaNs. With "sane" meaning, that 

set([nan, nan, nan]) => {nan}

Until now, one has only to override the equal-comparison for these types but now also hash-function needs to be overriden (which is more complicated e.g. for tuple).

----------------------------------------------------------------

On a more abstract level: there are multiple possible equivalence relations. 

The one used right now for floats is Py_EQ and results in float("nan)!=float("nan") due to IEEE 754.

In hash-set/dict we would like to use an equivalence relation for which all NaNs would be in the same equivalence class. Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent to float("nan") and which should be used in hash-set/dict?
msg395653 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-06-11 18:21
There is an error in the Python implementation of Decimal.__hash__. It calls super().__hash__(), but the C implementation calls object.__hash__().

Also, the documentation for floating point hash has the same error.
msg395685 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-06-12 12:15
New changeset 9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba by Serhiy Storchaka in branch 'main':
bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
https://github.com/python/cpython/commit/9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba
msg395745 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-06-13 13:02
@realead

> This change makes life harder for people trying to get sane behavior with sets [...]

Yes, code that assumes that all NaNs have the same hash value will need to change. But that doesn't seem unreasonable for objects that are already considered distinct with respect to the containment equivalence relation ("is" followed by "=="). I think this change was the right one to make, and my expectation was that there would be few cases of breakage, and that for those few cases it shouldn't be difficult to fix the breakage. If that's not true (either there are lots of cases of breakage, or the breakage is hard to fix), perhaps we should reconsider. I haven't yet seen evidence that either of those things is true, though.

> Maybe a new comparator is called for [...]

Agreed that that seems like a possible technical solution: Python could add a new special method __container_eq__ which is required to provide (at the least) a reflexive and symmetric relation, and whose default implementation does the same is-followed-by-== check that's already used for list, dict and set containment. For what it's worth, this is close to the approach that Julia takes: they have an "is_equal" function that's mostly the same as the normal equality == but differs for NaNs (and incidentally for signed zeros). But whether this would make sense from a language design perspective is another matter, and it would be a significant language-level change (certainly requiring a PEP). It feels like an enormous conceptual change to introduce to the language just to deal with the annoyance of NaNs. And that really is all it would be good for, unless perhaps it also provides a solution to the problem of NumPy arrays in containers, again caused by NumPy arrays overriding __eq__ in a nonstandard way.
msg395746 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-06-13 13:03
@Serhiy: can this be closed again? Does GH-26679 need to be backported to the 3.10 branch?
msg395748 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-06-13 13:39
> Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent to float("nan") and which should be used in hash-set/dict?

It is difficult to do from technical point of view because all rich comparison implementations support currently only 6 operations. If you call tp_richcomp with something else it will crash. So we need to add new slot tp_richcomp_ex and complex logic to call either tp_richcomp_ex or tp_richcomp and correctly handle inheritance. It is too high bar.

> Does GH-26679 need to be backported to the 3.10 branch?

I thought that the original change was 3.11 only, but seems it was merged before the split of the 3.10 branch. So PR 26679 needs to be backported. I would add a NEWS entry if I knew this.
msg395749 - (view) Author: miss-islington (miss-islington) Date: 2021-06-13 14:05
New changeset 128899d8b8d65d86bd9bbea6801e9f36e6f409f2 by Miss Islington (bot) in branch '3.10':
bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679)
https://github.com/python/cpython/commit/128899d8b8d65d86bd9bbea6801e9f36e6f409f2
msg395750 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-06-13 14:07
Does this change need to be mentioned in What's New?
msg395761 - (view) Author: (realead) Date: 2021-06-13 20:31
@mark.dickinson

> ...my expectation was that there would be few cases of breakage, and that for those few cases it shouldn't be difficult to fix the breakage.

This expectation is probably correct.

My issue is somewhat only partly on-topic here: If one wants to have all NaNs in one equivalency class (e.g. if used as a key-value for example in pandas) it is almost impossible to do so in a consistent way without taking a performance hit. It seems to be possible builtin-types (even if frozenset won't be pretty), but already something like


class A:
    def __init__(self, a):
        self.a=a
    def __hash__(self):
        return hash(self.a)
    def __eq__(self, other):
        return self.a == other.a

is not easy to handle.

A special comparator for containers would be an ultimative solution, but I see how this could be too much hassle for a corner case.
msg395773 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-06-14 04:40
> If one wants to have all NaNs in one equivalency class
> (e.g. if used as a key-value for example in pandas) it
> is almost impossible to do so in a consistent way 
> without taking a performance hit.

ISTM the performance of the equivalent class case is far less important than the one we were trying to solve.  Given a choice we should prefer helping normal unadorned instances rather than giving preference to a subclass that redefines the usual behaviors.  

In CPython, it is a fact of life that overriding builtin behaviors with pure python code always incurs a performance hit.  Also, in your example, the subclass isn't technically correct because it relies on a non-guaranteed implementation details.  It likely isn't even the fastest approach.

The only guaranteed behaviors are that math.isnan(x) reliably detects a NaN and that x!=x when x is a NaN.  Those are the only assured tools in the uphill battle to fight the weird intrinsic nature of NaNs.

So one possible solution is to replace all the NaNs with a canonical placeholder value that doesn't have undesired properties:

    {None if isnan(x) else x for x in arr}

That relies on guaranteed behaviors and is reasonably fast.  IMO that beats trying to reprogram float('NaN') to behave the opposite of how it was designed.
msg395783 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-06-14 08:53
> Does this change need to be mentioned in What's New?

Yes, I think so, given that it's a change to documented behavior. It's also something that third party packages (like NumPy) potentially need to be aware of.
msg395892 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-06-15 18:48
New changeset 1d10bf0bb9409a406c56b0de8870df998637fd0f by Mark Dickinson in branch 'main':
bpo-43475: Add what's new entry for NaN hash changes (GH-26725)
https://github.com/python/cpython/commit/1d10bf0bb9409a406c56b0de8870df998637fd0f
msg395894 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-06-15 19:13
New changeset 8d0b2ca493e236fcad8709a622c1ac8ad29c372d by Mark Dickinson in branch '3.10':
[3.10] bpo-43475: Add what's new entry for NaN hash changes (GH-26725) (GH-26743)
https://github.com/python/cpython/commit/8d0b2ca493e236fcad8709a622c1ac8ad29c372d
msg395918 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-06-16 11:36
I think this can be closed now.
msg406917 - (view) Author: Christoph Groth (cwg) Date: 2021-11-24 10:38
Hello.  I would like to point out a possible problem that this change to CPython has introduced.

This change looks innocent, but it breaks the unwritten rule that the hash value of a number (or actually any built-in immutable type!) in Python depends only on its value.

Thus, before this change, it was possible to convert a tuple of floats into a numpy array and back into a tuple, and the hash values of both tuples would be equal.  This is no longer the case.

Or, more generally, any hashable tuple could be pickled and unpickled, without changing its hash value.  I could well imagine that this breaks real code in subtle ways.

Likewise, it is now no longer possible to provide a library of sequences of numbers that always hashes like built-in tuples.  (As "tinyarray", of which I am the author, did.)
msg406925 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-11-24 12:58
@cwg: Yep, we're aware of this. There are no good solutions here - only a mass of constraints, compromises and trade-offs. I think we're already somewhere on the Pareto boundary of the "best we can do" given the constraints. Moving to another point on the boundary doesn't seem worth the code churn.

What concrete action would you propose that the Python core devs take at this point?

> it was possible to convert a tuple of floats into a numpy array and back into a tuple, and the hash values of both tuples would be equal.  This is no longer the case.

Sure, but the problem isn't really with hash; that's just a detail. It lies deeper than that - it's with containment itself:

>>> import numpy as np
>>> import math
>>> x = math.nan
>>> some_list = [1.5, 2.3, x]
>>> x in some_list
True
>>> x in list(np.array(some_list))  # expect True, get False
False

The result of the change linked to this PR is that the hash now also reflects that containment depends on object identity, not just object value. Reverting the change would solve the superficial hash problem, but not the underlying containment problem, and would re-introduce the performance issue that was fixed here.
msg406927 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2021-11-24 13:04
Just for fun: I gave a somewhat ranty 10-minute talk on this topic at a (virtual) conference a few months ago: https://www.youtube.com/watch?v=01oeosRVwgY
msg406928 - (view) Author: Christoph Groth (cwg) Date: 2021-11-24 13:46
> What concrete action would you propose that the Python core devs take at this point?

Nothing for now.

I stumbled across this issue through https://gitlab.kwant-project.org/kwant/tinyarray/-/issues/20 and had the impression that the aspect that I raised (that, for example, hash values of immutable built-in objects now no longer survive pickling) was not examined in this otherwise in-depth discussion.  So I added it for reference.

If problems come up that are caused by this change, I would consider reverting it a possible solution.

> The result of the change linked to this PR is that the hash now also reflects that containment depends on object identity, not just object value.

This is a nice way to phrase it.  Thanks for the link to the entertaining talk. :-)
History
Date User Action Args
2022-04-11 14:59:42adminsetgithub: 87641
2021-11-24 13:46:50cwgsetmessages: + msg406928
2021-11-24 13:04:30mark.dickinsonsetmessages: + msg406927
2021-11-24 12:58:33mark.dickinsonsetmessages: + msg406925
2021-11-24 10:38:53cwgsetnosy: + cwg
messages: + msg406917
2021-06-16 11:36:54mark.dickinsonsetstatus: open -> closed

messages: + msg395918
stage: patch review -> resolved
2021-06-15 19:13:17mark.dickinsonsetmessages: + msg395894
2021-06-15 18:51:36mark.dickinsonsetpull_requests: + pull_request25328
2021-06-15 18:48:44mark.dickinsonsetmessages: + msg395892
2021-06-14 18:06:34mark.dickinsonsetpull_requests: + pull_request25315
2021-06-14 08:53:56mark.dickinsonsetmessages: + msg395783
2021-06-14 04:40:26rhettingersetmessages: + msg395773
2021-06-13 20:31:33realeadsetmessages: + msg395761
2021-06-13 14:07:05serhiy.storchakasetmessages: + msg395750
2021-06-13 14:05:42miss-islingtonsetmessages: + msg395749
2021-06-13 13:39:12serhiy.storchakasetmessages: + msg395748
versions: + Python 3.10
2021-06-13 13:35:32miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request25292
2021-06-13 13:03:39mark.dickinsonsetmessages: + msg395746
2021-06-13 13:02:29mark.dickinsonsetmessages: + msg395745
2021-06-12 12:15:28serhiy.storchakasetmessages: + msg395685
2021-06-11 18:22:19serhiy.storchakasetstage: patch review
pull_requests: + pull_request25265
2021-06-11 18:21:29serhiy.storchakasetstatus: closed -> open

stage: resolved -> (no value)
messages: + msg395653
versions: + Python 3.11
2021-06-11 15:51:46realeadsetnosy: + realead
messages: + msg395643
2021-04-22 15:57:30mark.dickinsonsetmessages: + msg391607
2021-04-22 15:40:22mark.dickinsonsetmessages: + msg391606
2021-04-22 15:35:40rhettingersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-04-22 15:35:20rhettingersetmessages: + msg391605
2021-04-21 03:13:43rhettingersetkeywords: + patch
stage: patch review
pull_requests: + pull_request24216
2021-04-11 05:27:17rhettingersetmessages: + msg390770
2021-04-11 04:55:50tim.peterssetmessages: + msg390767
2021-04-11 00:31:57rhettingersetassignee: rhettinger
2021-04-10 07:52:32mark.dickinsonsetmessages: + msg390701
2021-04-10 07:14:42rhettingersetmessages: + msg390698
2021-03-15 15:49:57rhettingersetmessages: + msg388748
2021-03-15 09:36:44serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg388717
2021-03-15 02:53:09rhettingersetmessages: + msg388702
2021-03-13 11:32:56mark.dickinsonsetmessages: + msg388608
2021-03-13 11:23:46congmasetmessages: + msg388605
2021-03-13 11:16:35mark.dickinsonsetmessages: + msg388604
2021-03-13 10:34:54mark.dickinsonsetmessages: + msg388603
2021-03-13 01:12:22rhettingersetmessages: + msg388579
2021-03-11 19:54:31mark.dickinsonsetmessages: + msg388522
2021-03-11 19:28:24congmasetmessages: + msg388519
2021-03-11 18:57:41congmasetmessages: + msg388514
2021-03-11 17:58:01mark.dickinsonsetmessages: + msg388513
2021-03-11 17:57:11mark.dickinsonsetmessages: + msg388512
2021-03-11 17:35:41mark.dickinsonsetnosy: + tim.peters, rhettinger
messages: + msg388511
2021-03-11 15:56:35mark.dickinsonsetnosy: + mark.dickinson
2021-03-11 14:55:33congmacreate