Message 388514 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	congma
Recipients	congma, mark.dickinson, rhettinger, tim.peters
Date	2021-03-11.18:57:41
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1615489061.92.0.390808874253.issue43475@roundup.psfhosted.org>
In-reply-to

Content
Thank you @mark.dickinson for the detailed analysis. In addition to your hypothetical usage examples, I am also trying to understand the implications for user code. If judging by the issues people open on GitHub like this: https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run into the "NaN as key" problem every now and then, I guess. (I'm not familiar enough with the poster's real problem in the Pandas setting to comment about that one, but it seems to suggest people do run into "real world" problems that shares some common features with this one). There's also StackOverflow threads like this (https://stackoverflow.com/a/17062985) where people discuss hashing a data table that explicitly use NaN for missing values. The reason seems to be that "[e]mpirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation." (https://arxiv.org/pdf/0902.2206.pdf). I cannot imagine whether the proposed change would make life easier for people who really want NaN keys to begin with. However I think it might reduce the exposure to worst-case performances in dict (and maybe set/frozenset), while not hurting Python's own self-consistency. Maybe there's one thing to consider about future compatibility though... because the proposed fix depends on the current behaviour that floats (and by extension, built-in float-like objects such as Decimal and complex) are not cached, unlike small ints and interned Unicode objects. So when you explicitly construct a NaN by calling float(), you always get a new Python object. Is this guaranteed now on different platforms with different compilers? I'm trying to parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, where it seems to me that for certain configs it could be a `static const union` translation-unit-level constant. Does this affect the code that actually builds a Python object from an underlying NaN? (I apologize if this is a bad question). But if it doesn't and we're guaranteed to really have Python NaN-objects that don't alias if explicitly constructed, is this something unlikely to change in the future? I also wonder if there's security implication for servers that process user-submitted input, maybe running a float() constructor on some string list, checking exceptions but silently succeeding with "nan": arguably this is not going to be as common an concern as the string-key collision DoS now foiled by hash randomization, but I'm not entirely sure. On "theoretical interest".. yes theoretical interests can also be "real world" if one teaches CS theory in real world using Python, see https://bugs.python.org/issue43198#msg386849 So yes, I admit we're in an obscure corner of Python here. It's a tricky, maybe-theoretical, but seemingly annoying but easy-to-fix issue..

Thank you @mark.dickinson for the detailed analysis.

In addition to your hypothetical usage examples, I am also trying to understand the implications for user code.

If judging by the issues people open on GitHub like this: https://github.com/pandas-dev/pandas/issues/28363 yes apparently people do run into the "NaN as key" problem every now and then, I guess. (I'm not familiar enough with the poster's real problem in the Pandas setting to comment about that one, but it seems to suggest people do run into "real world" problems that shares some common features with this one). There's also StackOverflow threads like this (https://stackoverflow.com/a/17062985) where people discuss hashing a data table that explicitly use NaN for missing values. The reason seems to be that "[e]mpirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation." (https://arxiv.org/pdf/0902.2206.pdf).

I cannot imagine whether the proposed change would make life easier for people who really want NaN keys to begin with. However I think it might reduce the exposure to worst-case performances in dict (and maybe set/frozenset), while not hurting Python's own self-consistency.

Maybe there's one thing to consider about future compatibility though... because the proposed fix depends on the current behaviour that floats (and by extension, built-in float-like objects such as Decimal and complex) are not cached, unlike small ints and interned Unicode objects. So when you explicitly construct a NaN by calling float(), you always get a new Python object. Is this guaranteed now on different platforms with different compilers? I'm trying to parse the magic in Include/pymath.h about the definition of macro `Py_NAN`, where it seems to me that for certain configs it could be a `static const union` translation-unit-level constant. Does this affect the code that actually builds a Python object from an underlying NaN? (I apologize if this is a bad question). But if it doesn't and we're guaranteed to really have Python NaN-objects that don't alias if explicitly constructed, is this something unlikely to change in the future?

I also wonder if there's security implication for servers that process user-submitted input, maybe running a float() constructor on some string list, checking exceptions but silently succeeding with "nan": arguably this is not going to be as common an concern as the string-key collision DoS now foiled by hash randomization, but I'm not entirely sure.

On "theoretical interest".. yes theoretical interests can also be "real world" if one teaches CS theory in real world using Python, see https://bugs.python.org/issue43198#msg386849

So yes, I admit we're in an obscure corner of Python here. It's a tricky, maybe-theoretical, but seemingly annoying but easy-to-fix issue..

History
Date	User	Action	Args
2021-03-11 18:57:41	congma	set	recipients: + congma, tim.peters, rhettinger, mark.dickinson
2021-03-11 18:57:41	congma	set	messageid: <1615489061.92.0.390808874253.issue43475@roundup.psfhosted.org>
2021-03-11 18:57:41	congma	link	issue43475 messages
2021-03-11 18:57:41	congma	create