Issue 43475: Worst-case behaviour of hash collision with float NaN

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87641

classification

Title:	Worst-case behaviour of hash collision with float NaN
Type:	performance	Stage:	resolved
Components:	Interpreter Core, Library (Lib)	Versions:	Python 3.11, Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	congma, cwg, mark.dickinson, miss-islington, realead, rhettinger, serhiy.storchaka, tim.peters
Priority:	normal	Keywords:	patch

Created on 2021-03-11 14:55 by congma, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
nan_key.py	congma, 2021-03-11 14:55	PoC for NaN-collisions and possible fix for worst-case

Pull Requests
URL	Status	Linked	Edit
PR 25493	merged	rhettinger, 2021-04-21 03:13
PR 26679	merged	serhiy.storchaka, 2021-06-11 18:22
PR 26706	merged	miss-islington, 2021-06-13 13:35
PR 26725	merged	mark.dickinson, 2021-06-14 18:06
PR 26743	merged	mark.dickinson, 2021-06-15 18:51

Messages (40)
msg388508 - (view)	Author: Cong Ma (congma) *	Date: 2021-03-11 14:55
Summary: CPython hash all NaN values to 0. This guarantees worst-case behaviour for dict if numerous existing keys are NaN. I think by hashing NaN using the generic object (or "pointer") hash instead, the worst-case situation can be alleviated without changing the semantics of either dict or float. However, this also requires changes to how complex and Decimal objects hash, and moreover incompatible change to sys.hash_info. I would like to hear how Python developers think about this matter. -------- Currently CPython uses the hard-coded macro constant 0 (_PyHASH_NAN, defined in Include/pyhash.h) for the hash value of all floating point NaNs. The value is part of the sys.hashinfo API and is re-used by complex and Decimal in computing its hash in accordance with Python builtin-type documentation. [0] (The doc [0] specifically says that "[a]ll hashable nans have the same hash value.") This is normally not a great concern, except for the worst case performance. The problem is that, since they hash to the same value and they're guaranteed to compare unequal to any compatible numeric value -- not even to themselves, this means they're guaranteed to collide. For this reason I'd like to question whether it is a good idea to have all hashable NaNs with the same hash value. There has been some discussions about this over the Web for some time (see [1]). In [1] the demo Python script times the insertion of k distinct NaN keys (as different objects) into a freshly created dict. Since the keys are distinct and are guaranteed to collide with each other (if any), the run time of a single lookup/insertion is roughly linear to the existing number of NaN keys. I've recreated the same script using with a more modern Python (attached). I'd suggest a fix for this worst-case behaviour: instead of returning the hash value 0 for all NaNs, use the generic object (pointer) hash for these objects. As a PoC (also in the attached script), it roughly means ``` class myfloat(float): def __hash__(self): if self != self: # i.e., math.isnan(self) return object.__hash__(self) return super().__hash__(self) ``` This will - keep the current behaviour of dict intact; - keep the invariant `a == b implies hash(a) == hash(b)` intact, where applicable; - uphold all the other rules for Python numeric objects listed in [0]; - make hash collisions no more likely than with object() instances (dict lookup time is amortized constant w.r.t. existing number of NaN keys). However, it will - not keep the current rule "All hashable nans have the same hash value"; - not be compatible with the current sys.hash_info API (requires the removal of the "nan" attribute from there and documenting the change); - require congruent modifications in complex and Decimal too. Additionally, I don't think this will affect module-level NaN "constants" such as math.nan and how they behave. The "NaN constant" has never been a problem to begin with. It's only the distinct NaN objects that may cause the worst-case behaviour. -------- Just for the record I'd also like to clear some outdated info or misconception about NaN keys in Python dicts. It's not true that NaN keys, once inserted, cannot be retrieved (e.g., as claimed in [1][2]). In Python, they can be, if you keep the original key object around by keeping a reference to it (or obtaining a new one from the dict by iterating over it). This, I think, is because Python dict compares for object identity before rich-comparing for equality in `lookdict()` in Objects/dictobject.c, so this works for `d = dict()`: ``` f = float("nan") d[f] = "value" v = d[f] ``` but this fails with `KeyError`, as it should: ``` d[float("nan")] = "value" v = d[float("nan")] ``` In this regard the NaN float object behaves exactly like the object() instance as keys -- except for the collisions. That's why I think at least for floats the "object" hash is likely to work. The solution using PRNG [1] (implemented with the Go language) is not necessary for CPython because the live objects are already distinct. -------- Links: [0] https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types [1] https://research.swtch.com/randhash [2] https://readafterwrite.wordpress.com/2017/03/23/how-to-hash-floating-point-numbers/
msg388511 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-11 17:35
Sigh. When I'm undisputed ruler of the multiverse, I'm going to make "NaN == NaN" return True, IEEE 754 be damned. NaN != NaN is fine(ish) at the level of numerics; the problems start when the consequences of that choice leak into the other parts of the language that care about equality. NaNs just shouldn't be considered "special enough to break the rules" (where the rules here are that the equivalence relation being used as the basis of equality for a general container should actually be an equivalence relation - reflexive, symmetric, and transitive). Anyway, more constructively ... I agree with the analysis, and the proposed solution seems sound: if we're going to change this, then using the object hash seems like a workable solution. The question is whether we actually do need to change this. I'm not too worried about backwards compatibility here: if we change this, we're bound to break someone's code somewhere, but it's hard to imagine that there's that much code out there that makes useful use of the property that hash(nan) == 0. The most likely source of breakage I can think of is in test code that checks that 3rd-party float-like things (NumPy's float64, or gmpy2 floats, or ...) behave like Python's floats. @Cong Ma: do you have any examples of cases where this has proved, or is likely to prove, a problem in real-world code, or is this purely theoretical at this point? I'm finding it hard to imagine cases where a developer intentionally has a dictionary with several NaN values as keys. (Even a single NaN as a key seems a stretch; general floats as dictionary keys are rare in real-world code that I've encountered.) But it's somewhat easier to imagine situations where one could run into this accidentally. Here's one such: >>> import collections, numpy as np >>> x = np.full(100000, np.nan) >>> c = collections.Counter(x) That took around 4 minutes of non-ctrl-C-interruptible time on my laptop. (I was actually expecting it not to "work" as a bad example: I was expecting that NaNs realised from a NumPy array would all be the same NaN object, but that's not what happens.) For comparison, this appears instant: >>> x = np.random.rand(100000) >>> c = collections.Counter(x) And it's not a stretch to imagine having a NumPy array representing a masked binary 1024x1024-pixel image (float values of NaN, 0.0 and 1.0) and using Counter to find out how many 0s and 1s there were. On the other hand, we've lived with the current behaviour for some decades now without it apparently ever being a real issue. On balance, I'm +1 for fixing this. It seems like a failure mode that would be rarely encountered, but it's quite unpleasant when it is encountered.
msg388512 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-11 17:57
Hmm. On second thoughts, the proposed solution wouldn't actually help with the situation I gave: the elements (lazily) realised from the NumPy array are highly likely to all end up with the same address in RAM. :-( >>> x = np.full(10, np.nan) >>> for v in x: ... print(id(v)) ... del v ... 4601757008 4601757008 4601757008 4601757008 4601757008 4601757008 4601757008 4601757008 4601757008 4601757008
msg388513 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-11 17:58
On third thoughts, of course it would help, because the Counter is keeping references to the realised NaN values. I think I'll go away now and come back when my brain is working.
msg388514 - (view)	Author: Cong Ma (congma) *	Date: 2021-03-11 18:57
Thank you @mark.dickinson for the detailed analysis. In addition to your hypothetical usage examples, I am also trying to understand the implications for user code. If judging by the issues people open on GitHub like this: https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run into the "NaN as key" problem every now and then, I guess. (I'm not familiar enough with the poster's real problem in the Pandas setting to comment about that one, but it seems to suggest people do run into "real world" problems that shares some common features with this one). There's also StackOverflow threads like this (https://stackoverflow.com/a/17062985) where people discuss hashing a data table that explicitly use NaN for missing values. The reason seems to be that "[e]mpirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation." (https://arxiv.org/pdf/0902.2206.pdf). I cannot imagine whether the proposed change would make life easier for people who really want NaN keys to begin with. However I think it might reduce the exposure to worst-case performances in dict (and maybe set/frozenset), while not hurting Python's own self-consistency. Maybe there's one thing to consider about future compatibility though... because the proposed fix depends on the current behaviour that floats (and by extension, built-in float-like objects such as Decimal and complex) are not cached, unlike small ints and interned Unicode objects. So when you explicitly construct a NaN by calling float(), you always get a new Python object. Is this guaranteed now on different platforms with different compilers? I'm trying to parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, where it seems to me that for certain configs it could be a `static const union` translation-unit-level constant. Does this affect the code that actually builds a Python object from an underlying NaN? (I apologize if this is a bad question). But if it doesn't and we're guaranteed to really have Python NaN-objects that don't alias if explicitly constructed, is this something unlikely to change in the future? I also wonder if there's security implication for servers that process user-submitted input, maybe running a float() constructor on some string list, checking exceptions but silently succeeding with "nan": arguably this is not going to be as common an concern as the string-key collision DoS now foiled by hash randomization, but I'm not entirely sure. On "theoretical interest".. yes theoretical interests can also be "real world" if one teaches CS theory in real world using Python, see https://bugs.python.org/issue43198#msg386849 So yes, I admit we're in an obscure corner of Python here. It's a tricky, maybe-theoretical, but seemingly annoying but easy-to-fix issue..
msg388519 - (view)	Author: Cong Ma (congma) *	Date: 2021-03-11 19:28
Sorry, please ignore my rambling about "float() returning aliased object" -- in that case the problem with hashing doesn't arise.
msg388522 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-11 19:54
> I also wonder if there's security implication for servers that process user-submitted input Yes, the "malicious actor" scenario is another one to consider. But unlike the string hashing attack, I'm not seeing a realistic way for the nan hash collisions to be used in attacks, and I'm content not to worry about that until someone gives an actual proof of concept. Many of Python's hash functions are fairly predictable (by design!) and there are already lots of other ways to deliberately construct lots of hash collisions with non-string non-float values.
msg388579 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-03-13 01:12
Idea: We could make this problem go away by making NaN a singleton.
msg388603 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-13 10:34
> We could make this problem go away by making NaN a singleton. That could work, though we'd annoy anyone who cared about preserving NaN payloads and signs in Python. I don't know if such people exist. Currently the sign is accessible through things like math.copysign, and the payload through struct manipulations (though on most platforms we still don't see the full range of NaNs: signalling NaNs are quietly silenced on my machine). We'd also want to do some performance checks: the obvious way to do this would be to have an "is_nan" check in PyFloat_FromDouble. I'd guess that a typical CPU would do a reasonable job of branch prediction on that check so that it had minimal impact in normal use, but only timings will tell for sure.
msg388604 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-13 11:16
> signalling NaNs are quietly silenced on my machine I just tested this assertion, and it appears that I was lying: this doesn't seem to be true. I'm almost sure it used to be true, and I'm not sure what changed, and when. (FWIW: Python 3.9.2 from macPorts on macOS 10.14.6 + Intel hardware. Earlier versions of Python on the same machine give similar results to those below, so it's not a core Python change.) Here's an attempt to create an snan Python float: Python 3.9.2 (default, Feb 28 2021, 13:47:30) [Clang 10.0.1 (clang-1001.0.46.4)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import struct >>> snan_bits = 0x7ff0000000123456 >>> snan = struct.unpack("<d", struct.pack("<Q", snan_bits))[0] >>> snan nan Converting back shows that the attempt was successful: the payload, including the signalling bit (bit 51, counting with the lsb as bit 0), was preserved: >>> hex(struct.unpack("<Q", struct.pack("<d", snan))[0]) '0x7ff0000000123456' As expected, doing any arithmetic on the value converts the signalling nan to the corresponding quiet nan (so that the leading 16 bits become 0x7ff8 instead of 0x7ff0), but preserves the rest of the payload. >>> hex(struct.unpack("<Q", struct.pack("<d", 0.0 + snan))[0]) '0x7ff8000000123456'
msg388605 - (view)	Author: Cong Ma (congma) *	Date: 2021-03-13 11:23
> Idea: We could make this problem go away by making NaN a singleton. Apart from the fact that NaN is not a specific value/object (as pointed out in the previous message by @mark.dickinson), currently each builtin singleton (None, True, False, etc.) in Python satisfies the following predicate: `if s is a singleton, then s == s` This is also satisfied by "flyweight" objects such as small ints. It feels natural and unsurprising to have this, if not officially promised. Requiring NaN to be a singleton would violate this implicit understanding about singletons, and cause another surprise on top of the other surprises with NaNs (or with rich comparison in general). Performance-wise, I think the current behaviour (returning _PyHASH_NAN == 0) is already nested inside two layers of conditionals in `_Py_HashDouble()` in Python/pyhash.c: ``` if (!Py_IS_FINITE(v)) { if (Py_IS_INFINITY(v)) return v > 0 ? _PyHASH_INF : -_PyHASH_INF; else return _PyHASH_NAN; } ``` (v is the underlying C `double`, field `ob_fval` of `PyFloatObject`). I don't feel performance would be hurt by rewriting `float_hash()` in Objects/floatobject.c to the effect of ``` if (!Py_IS_NAN(v->ob_fval)) { return _Py_HashDouble(v->ob_fval) } else { return _Py_HashPointer(v); } ``` and simplify the conditionals in `_Py_HashDouble()`. But of course, only real benchmarking could tell. (Also, other float-like types would have to be adjusted, too).
msg388608 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-13 11:32
@Cong Ma: Yes, I'm not worried about performance for the change you're proposing (though as you say, we should benchmark anyway). The performance thoughts were motivated by the idea of making NaN a singleton: adding a check to PyFloat_FromDouble would mean that almost every operation that produced a float would have to pass through that check.
msg388702 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-03-15 02:53
> The performance thoughts were motivated by the idea of > making NaN a singleton: adding a check to > PyFloat_FromDouble would mean that almost every operation > that produced a float would have to pass through that check. It may suffice to move the check upstream from PyFloat_FromDouble so that float('NaN') alway produces identically the same object as math.nan. That would handle the common cases where NaN is used for missing values or is generated from string conversions. We don't need a bullet-proof solution, just mitigation of harm.
msg388717 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-03-15 09:36
What about Decimal NaN? Even if make float NaN a singleton, there will be other NaNs. And making float('nan') returning a singleton, but 1e1000 * 0 returning different NaN would cause large confusion.
msg388748 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-03-15 15:49
> And making float('nan') returning a singleton, > but 1e1000 * 0 returning different NaN would cause large confusion. Not really, it would be just be an implementation detail, no different than int and strings being sometimes unique and sometimes not. Historically, immutable objects are allowed to be reused when it is convenient for the implementation. > What about Decimal NaN? Decimal isn't used as much, so the need is less pressing, but we can do whatever is allowed by the spec. Presumably, that would be easier than with floats because we control all possible ways to construct Decimals.
msg390698 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-04-10 07:14
Mark, is there any reason hash(float('NaN')) and hash(Decimal('NaN')) have to be a constant? Since NaNs don't compare equal, the hash homomorphism has no restrictions. Why not have hash() return the id() like we do for instances of object? I understand that sys.hash_info.nan would be invalidated, but that was likely useless anyway.
msg390701 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-04-10 07:52
> Why not have hash() return the id() like we do for instances of object? I think that's fine, and IIUC that's what Cong Ma was proposing. It seems like the least invasive potential fix. In principle I find the idea of making NaN a singleton rather attractive - the performance hit is likely negligible, and it solves a bunch of subtle NaN-related issues. (For example, there's still a proposal to provide an IEEE 754 total_ordering key so that lists of floats can be ordered sanely in the presence of nans, but even if you use that you don't know whether your nans will end up at the start or the end of the list, and you could potentially have nans in both places; fixing a single nan and its sign bit would fix that.) But there are bound to be some people somewhere making use of the NaN payload and sign bit, however inadvisedly, and Serhiy's right that this couldn't be extended to Decimal, where the sign bit and payload of the NaN are mandated by the standard.
msg390767 - (view)	Author: Tim Peters (tim.peters) *	Date: 2021-04-11 04:55
I agree hashing a NaN acting like the generic object hash (return rotated address) is a fine workaround, although I'm not convinced it addresses a real-world problem ;-) But why not? It might. But that's for CPython. I'm loathe to guarantee anything about this in the language itself. If an implementation wants `__contains__()` tests to treat all NaNs as equal instead, or consider them equal only if "is" holds, or never considers them equal, or resolves equality as bitwise representation equality ... all are fine by me. There's no truly compelling case to made for any of them, although "never considers them equal" is least "surprising" given the insane requirement that NaNs never compare equal under IEEE rules, and that Python-the-language doesn't formally support different notions of equality in different contexts.
msg390770 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-04-11 05:27
> I'm loathe to guarantee anything about this in the language itself. There aren't language any guarantees being proposed. Letting the hash depend on the object id just helps avoid quadratic behavior. Making float('NaN') a singleton is also perfectly reasonable behavior for an immutable type. Neither is a guaranteed behavior, just a convenient one. > I'm not convinced it addresses a real-world problem Maybe yes, maybe no. I would hope that NaNs arising from bogus calculations would be rare. OTOH, they are commonly used for missing values in Pandas where internal dict/set operations abound. Either way, I would like to close off a trivially easy way to invoke quadratic behavior unexpectedly.
msg391605 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-04-22 15:35
New changeset a07da09ad5bd7d234ccd084a3a0933c290d1b592 by Raymond Hettinger in branch 'master': bpo-43475: Fix worst case collision behavior for NaN instances (GH-25493) https://github.com/python/cpython/commit/a07da09ad5bd7d234ccd084a3a0933c290d1b592
msg391606 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-04-22 15:40
Thanks, Raymond. I'll open something on the NumPy bug tracker, too, since they may want to consider doing something similar for numpy.float64 and friends.
msg391607 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-04-22 15:57
Opened https://github.com/numpy/numpy/issues/18833
msg395643 - (view)	Author: (realead)	Date: 2021-06-11 15:51
This change makes life harder for people trying to get sane behavior with sets for float-, complex- and tuple-objects containing NaNs. With "sane" meaning, that set([nan, nan, nan]) => {nan} Until now, one has only to override the equal-comparison for these types but now also hash-function needs to be overriden (which is more complicated e.g. for tuple). ---------------------------------------------------------------- On a more abstract level: there are multiple possible equivalence relations. The one used right now for floats is Py_EQ and results in float("nan)!=float("nan") due to IEEE 754. In hash-set/dict we would like to use an equivalence relation for which all NaNs would be in the same equivalence class. Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent to float("nan") and which should be used in hash-set/dict?
msg395653 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-06-11 18:21
There is an error in the Python implementation of Decimal.__hash__. It calls super().__hash__(), but the C implementation calls object.__hash__(). Also, the documentation for floating point hash has the same error.
msg395685 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-06-12 12:15
New changeset 9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba by Serhiy Storchaka in branch 'main': bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679) https://github.com/python/cpython/commit/9f1c5f6e8af6ba3f659b2aea1e221ac9695828ba
msg395745 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-06-13 13:02
@realead > This change makes life harder for people trying to get sane behavior with sets [...] Yes, code that assumes that all NaNs have the same hash value will need to change. But that doesn't seem unreasonable for objects that are already considered distinct with respect to the containment equivalence relation ("is" followed by "=="). I think this change was the right one to make, and my expectation was that there would be few cases of breakage, and that for those few cases it shouldn't be difficult to fix the breakage. If that's not true (either there are lots of cases of breakage, or the breakage is hard to fix), perhaps we should reconsider. I haven't yet seen evidence that either of those things is true, though. > Maybe a new comparator is called for [...] Agreed that that seems like a possible technical solution: Python could add a new special method __container_eq__ which is required to provide (at the least) a reflexive and symmetric relation, and whose default implementation does the same is-followed-by-== check that's already used for list, dict and set containment. For what it's worth, this is close to the approach that Julia takes: they have an "is_equal" function that's mostly the same as the normal equality == but differs for NaNs (and incidentally for signed zeros). But whether this would make sense from a language design perspective is another matter, and it would be a significant language-level change (certainly requiring a PEP). It feels like an enormous conceptual change to introduce to the language just to deal with the annoyance of NaNs. And that really is all it would be good for, unless perhaps it also provides a solution to the problem of NumPy arrays in containers, again caused by NumPy arrays overriding __eq__ in a nonstandard way.
msg395746 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-06-13 13:03
@Serhiy: can this be closed again? Does GH-26679 need to be backported to the 3.10 branch?
msg395748 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-06-13 13:39
> Maybe a new comparator is called for (like PY_EQ_FOR_HASH_COLLECTION), which would yield float("nan") equivalent to float("nan") and which should be used in hash-set/dict? It is difficult to do from technical point of view because all rich comparison implementations support currently only 6 operations. If you call tp_richcomp with something else it will crash. So we need to add new slot tp_richcomp_ex and complex logic to call either tp_richcomp_ex or tp_richcomp and correctly handle inheritance. It is too high bar. > Does GH-26679 need to be backported to the 3.10 branch? I thought that the original change was 3.11 only, but seems it was merged before the split of the 3.10 branch. So PR 26679 needs to be backported. I would add a NEWS entry if I knew this.
msg395749 - (view)	Author: miss-islington (miss-islington)	Date: 2021-06-13 14:05
New changeset 128899d8b8d65d86bd9bbea6801e9f36e6f409f2 by Miss Islington (bot) in branch '3.10': bpo-43475: Fix the Python implementation of hash of Decimal NaN (GH-26679) https://github.com/python/cpython/commit/128899d8b8d65d86bd9bbea6801e9f36e6f409f2
msg395750 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-06-13 14:07
Does this change need to be mentioned in What's New?
msg395761 - (view)	Author: (realead)	Date: 2021-06-13 20:31
@mark.dickinson > ...my expectation was that there would be few cases of breakage, and that for those few cases it shouldn't be difficult to fix the breakage. This expectation is probably correct. My issue is somewhat only partly on-topic here: If one wants to have all NaNs in one equivalency class (e.g. if used as a key-value for example in pandas) it is almost impossible to do so in a consistent way without taking a performance hit. It seems to be possible builtin-types (even if frozenset won't be pretty), but already something like class A: def __init__(self, a): self.a=a def __hash__(self): return hash(self.a) def __eq__(self, other): return self.a == other.a is not easy to handle. A special comparator for containers would be an ultimative solution, but I see how this could be too much hassle for a corner case.
msg395773 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-06-14 04:40
> If one wants to have all NaNs in one equivalency class > (e.g. if used as a key-value for example in pandas) it > is almost impossible to do so in a consistent way > without taking a performance hit. ISTM the performance of the equivalent class case is far less important than the one we were trying to solve. Given a choice we should prefer helping normal unadorned instances rather than giving preference to a subclass that redefines the usual behaviors. In CPython, it is a fact of life that overriding builtin behaviors with pure python code always incurs a performance hit. Also, in your example, the subclass isn't technically correct because it relies on a non-guaranteed implementation details. It likely isn't even the fastest approach. The only guaranteed behaviors are that math.isnan(x) reliably detects a NaN and that x!=x when x is a NaN. Those are the only assured tools in the uphill battle to fight the weird intrinsic nature of NaNs. So one possible solution is to replace all the NaNs with a canonical placeholder value that doesn't have undesired properties: {None if isnan(x) else x for x in arr} That relies on guaranteed behaviors and is reasonably fast. IMO that beats trying to reprogram float('NaN') to behave the opposite of how it was designed.
msg395783 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-06-14 08:53
> Does this change need to be mentioned in What's New? Yes, I think so, given that it's a change to documented behavior. It's also something that third party packages (like NumPy) potentially need to be aware of.
msg395892 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-06-15 18:48
New changeset 1d10bf0bb9409a406c56b0de8870df998637fd0f by Mark Dickinson in branch 'main': bpo-43475: Add what's new entry for NaN hash changes (GH-26725) https://github.com/python/cpython/commit/1d10bf0bb9409a406c56b0de8870df998637fd0f
msg395894 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-06-15 19:13
New changeset 8d0b2ca493e236fcad8709a622c1ac8ad29c372d by Mark Dickinson in branch '3.10': [3.10] bpo-43475: Add what's new entry for NaN hash changes (GH-26725) (GH-26743) https://github.com/python/cpython/commit/8d0b2ca493e236fcad8709a622c1ac8ad29c372d
msg395918 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-06-16 11:36
I think this can be closed now.
msg406917 - (view)	Author: Christoph Groth (cwg)	Date: 2021-11-24 10:38
Hello. I would like to point out a possible problem that this change to CPython has introduced. This change looks innocent, but it breaks the unwritten rule that the hash value of a number (or actually any built-in immutable type!) in Python depends only on its value. Thus, before this change, it was possible to convert a tuple of floats into a numpy array and back into a tuple, and the hash values of both tuples would be equal. This is no longer the case. Or, more generally, any hashable tuple could be pickled and unpickled, without changing its hash value. I could well imagine that this breaks real code in subtle ways. Likewise, it is now no longer possible to provide a library of sequences of numbers that always hashes like built-in tuples. (As "tinyarray", of which I am the author, did.)
msg406925 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-11-24 12:58
@cwg: Yep, we're aware of this. There are no good solutions here - only a mass of constraints, compromises and trade-offs. I think we're already somewhere on the Pareto boundary of the "best we can do" given the constraints. Moving to another point on the boundary doesn't seem worth the code churn. What concrete action would you propose that the Python core devs take at this point? > it was possible to convert a tuple of floats into a numpy array and back into a tuple, and the hash values of both tuples would be equal. This is no longer the case. Sure, but the problem isn't really with hash; that's just a detail. It lies deeper than that - it's with containment itself: >>> import numpy as np >>> import math >>> x = math.nan >>> some_list = [1.5, 2.3, x] >>> x in some_list True >>> x in list(np.array(some_list)) # expect True, get False False The result of the change linked to this PR is that the hash now also reflects that containment depends on object identity, not just object value. Reverting the change would solve the superficial hash problem, but not the underlying containment problem, and would re-introduce the performance issue that was fixed here.
msg406927 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-11-24 13:04
Just for fun: I gave a somewhat ranty 10-minute talk on this topic at a (virtual) conference a few months ago: https://www.youtube.com/watch?v=01oeosRVwgY
msg406928 - (view)	Author: Christoph Groth (cwg)	Date: 2021-11-24 13:46
> What concrete action would you propose that the Python core devs take at this point? Nothing for now. I stumbled across this issue through https://gitlab.kwant-project.org/kwant/tinyarray/-/issues/20 and had the impression that the aspect that I raised (that, for example, hash values of immutable built-in objects now no longer survive pickling) was not examined in this otherwise in-depth discussion. So I added it for reference. If problems come up that are caused by this change, I would consider reverting it a possible solution. > The result of the change linked to this PR is that the hash now also reflects that containment depends on object identity, not just object value. This is a nice way to phrase it. Thanks for the link to the entertaining talk. :-)

History
Date	User	Action	Args
2022-04-11 14:59:42	admin	set	github: 87641
2021-11-24 13:46:50	cwg	set	messages: + msg406928
2021-11-24 13:04:30	mark.dickinson	set	messages: + msg406927
2021-11-24 12:58:33	mark.dickinson	set	messages: + msg406925
2021-11-24 10:38:53	cwg	set	nosy: + cwg messages: + msg406917
2021-06-16 11:36:54	mark.dickinson	set	status: open -> closed messages: + msg395918 stage: patch review -> resolved
2021-06-15 19:13:17	mark.dickinson	set	messages: + msg395894
2021-06-15 18:51:36	mark.dickinson	set	pull_requests: + pull_request25328
2021-06-15 18:48:44	mark.dickinson	set	messages: + msg395892
2021-06-14 18:06:34	mark.dickinson	set	pull_requests: + pull_request25315
2021-06-14 08:53:56	mark.dickinson	set	messages: + msg395783
2021-06-14 04:40:26	rhettinger	set	messages: + msg395773
2021-06-13 20:31:33	realead	set	messages: + msg395761
2021-06-13 14:07:05	serhiy.storchaka	set	messages: + msg395750
2021-06-13 14:05:42	miss-islington	set	messages: + msg395749
2021-06-13 13:39:12	serhiy.storchaka	set	messages: + msg395748 versions: + Python 3.10
2021-06-13 13:35:32	miss-islington	set	nosy: + miss-islington pull_requests: + pull_request25292
2021-06-13 13:03:39	mark.dickinson	set	messages: + msg395746
2021-06-13 13:02:29	mark.dickinson	set	messages: + msg395745
2021-06-12 12:15:28	serhiy.storchaka	set	messages: + msg395685
2021-06-11 18:22:19	serhiy.storchaka	set	stage: patch review pull_requests: + pull_request25265
2021-06-11 18:21:29	serhiy.storchaka	set	status: closed -> open stage: resolved -> (no value) messages: + msg395653 versions: + Python 3.11
2021-06-11 15:51:46	realead	set	nosy: + realead messages: + msg395643
2021-04-22 15:57:30	mark.dickinson	set	messages: + msg391607
2021-04-22 15:40:22	mark.dickinson	set	messages: + msg391606
2021-04-22 15:35:40	rhettinger	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2021-04-22 15:35:20	rhettinger	set	messages: + msg391605
2021-04-21 03:13:43	rhettinger	set	keywords: + patch stage: patch review pull_requests: + pull_request24216
2021-04-11 05:27:17	rhettinger	set	messages: + msg390770
2021-04-11 04:55:50	tim.peters	set	messages: + msg390767
2021-04-11 00:31:57	rhettinger	set	assignee: rhettinger
2021-04-10 07:52:32	mark.dickinson	set	messages: + msg390701
2021-04-10 07:14:42	rhettinger	set	messages: + msg390698
2021-03-15 15:49:57	rhettinger	set	messages: + msg388748
2021-03-15 09:36:44	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg388717
2021-03-15 02:53:09	rhettinger	set	messages: + msg388702
2021-03-13 11:32:56	mark.dickinson	set	messages: + msg388608
2021-03-13 11:23:46	congma	set	messages: + msg388605
2021-03-13 11:16:35	mark.dickinson	set	messages: + msg388604
2021-03-13 10:34:54	mark.dickinson	set	messages: + msg388603
2021-03-13 01:12:22	rhettinger	set	messages: + msg388579
2021-03-11 19:54:31	mark.dickinson	set	messages: + msg388522
2021-03-11 19:28:24	congma	set	messages: + msg388519
2021-03-11 18:57:41	congma	set	messages: + msg388514
2021-03-11 17:58:01	mark.dickinson	set	messages: + msg388513
2021-03-11 17:57:11	mark.dickinson	set	messages: + msg388512
2021-03-11 17:35:41	mark.dickinson	set	nosy: + tim.peters, rhettinger messages: + msg388511
2021-03-11 15:56:35	mark.dickinson	set	nosy: + mark.dickinson
2021-03-11 14:55:33	congma	create