classification
Title: Better NaN sorting.
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Marco Sulla, brandtbucher, cheryl.sabella, mark.dickinson, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2019-02-23 19:39 by brandtbucher, last changed 2019-12-23 21:00 by Marco Sulla. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12001 closed brandtbucher, 2019-02-23 19:39
Messages (25)
msg336401 - (view) Author: Brandt Bucher (brandtbucher) * (Python triager) Date: 2019-02-23 19:39
Sorting sequences containing NaN values produces an incompletely sorted result. Further, because of the complexity of the timsort, this incomplete sort often silently produces unintuitive, unstable-seeming results that are extremely sensitive to the ordering of the inputs:

>>> sorted([3, 1, 2, float('nan'), 2.0, 2, 2.0])
[1, 2, 2.0, 2.0, 3, nan, 2]
>>> sorted(reversed([3, 1, 2, float('nan'), 2.0, 2, 2.0]))
[1, 2.0, 2, 2.0, nan, 2, 3]

The patch I have provided addresses these issues, including for lists containing nested lists/tuples with NaN values. Specifically, it stably sorts NaNs to the end of the list with no changes to the timsort itself (just the element-wise comparison functions):

>>> sorted([3, 1, 2, float('nan'), 2.0, 2, 2.0])
[1, 2, 2.0, 2, 2.0, 3, nan]
>>> sorted([[3], [1], [2], [float('nan')], [2.0], [2], [2.0]])
[[1], [2], [2.0], [2], [2.0], [3], [nan]]

It also includes a new regression test for this behavior.

Some other benefits to this patch:

* These changes generally result in a sorting performance improvement across data types. The largest increases here are for nested lists, since we add a new unsafe_list_compare function. Other speed increases are due to safe_object_compare's delegation to unsafe comparison functions for objects of the same type. Specifically, the speed impact (positive is faster, negative is slower) is between:

    * -3% and +3% (10 elements, no PGO)
    * 0% and +4% (10 elements, PGO)
    * 0% and +9% (1000 elements, no PGO)
    * -1% and +9% (1000 elements, PGO)

* The current weird NaN-sorting behavior is not documented, so this is not a breaking change.

* IEEE754 compliance is maintained. The result is still a stable (arguably, more stable), nondecreasing ordering of the original list.
msg336402 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-02-23 19:58
This has been brought up in the past, for example, see #12286.
msg336405 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-23 21:14
Does it work for subclasses of float? For other floating point types like numpy.float32?
msg336407 - (view) Author: Brandt Bucher (brandtbucher) * (Python triager) Date: 2019-02-23 21:49
As a design decision, I consciously chose "no". However, it would be straightforward to make the change to support float subclasses:

#define ISNAN(N) (N->ob_type == &PyFloat_Type && Py_IS_NAN(PyFloat_AsDouble(N)))

becomes

#define ISNAN(N) (PyFloat_Check(N) && Py_IS_NAN(PyFloat_AsDouble(N)))
msg336412 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-23 22:42
-1 I am flat opposed to special casing the list.sort() code for the specific case of NaNs.  

This is an incorrect delegation of responsibility.  The sorted objects are responsible for saying how they will sort and whether than is. deterministic.  Note, float('NaN') isn't the only case.  Sets objects only have a partial ordering.

Also note that list.sort() isn't the only tool that compares objects.  There is also min(), max(), heaps, bisect, etc.  For the most part, we've able to keep those all it sync with one another by a clear separation of responsibilities (objects decide how they are compared versus tools that use comparisons).

At the very least, this would need a python-ideas discussion. FWIW, the usual solution to this problem is to strip the NaN values using math.isnan().
msg336418 - (view) Author: Brandt Bucher (brandtbucher) * (Python triager) Date: 2019-02-23 23:45
Thanks for the feedback. I agree with you on the iffy delegation issue. However, this is problem that I feel deserves a fix... the behavior (silently producing garbage results) is just so un-pythonic. It’s been made clear in other issues that a warning isn’t right here, and sorting is guaranteed to use __lt__ (which NaN defines well for use in all other contexts). I honestly don’t see any other way. If that means fixing min, max, and friends too, I think it’s worth it.

Special cases justify special-casing, especially when doing so doesn’t break the rules. NaN isn’t like any other value of any other type; It’s practically defined by its ability to ruin valid comparison operations, even with fellow floats, for which ordering is otherwise well-defined.

For what it’s worth, I don’t know if anybody who’s tried to sort sets. Every Python programmer I know has blindly sorted floats at some point.
msg336421 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-24 02:55
This issue isn't sorting.  The issue is with what NaNs do.  You can propose to "fix" NaNs rather than special casing everything else in Python that compares two values.

A NaN could be given a deterministic sort order relative to other floats (much as None used to compare less than everything else),  That said, you'll find people who argue than NaNs aren't broken and are doing exactly what they're supposed to do.

Please take this to python-ideas.  From my point of view, the current proposal is a non-starter.
msg336432 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-02-24 04:36
If we could roll back the clock, I'd impose a total ordering on floats in Python and supply some other way to spell 754's comparison operators (which users would have to ask for explicitly if they wanted to hassle with the near-useless and too-often-surprising "unordered" outcome).

But, too late.  As is, I don't want to mess up every context that _uses_ comparisons under the covers to make a special case out of NaNs.  Floats are a partially ordered type, and live with the consequences.

Which are approximately none for me ;-)  I rarely have a use for a NaN to begin with, and never in a list I intend to sort.

That said, the one thing that gives me pause is this line from numpy's docs:

   https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html
    "In numpy versions >= 1.4.0 nan values are sorted to the end."

But that's not the end of it:

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
    ...
    na_position : {‘first’, ‘last’}, default ‘last’
    first puts NaNs at the beginning, last puts NaNs at the end

So people who go down this path can't get two steps before making a fork in the road ;-)
msg336457 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-02-24 10:37
Agreed with Raymond and Tim here that the sorting functionality itself shouldn't be special-cased for nans.

[Tim]
> So people who go down this path can't get two steps before making a fork in the road ;-)

Well, both those solutions are wrong. *Clearly*, negative NaNs should be sorted to the start of the list, while positive NaNs are sorted to the end of the list. (Yes, I'm joking, but only a little bit: that's the result you'd get if you used IEEE 754's totalOrder to sort. But it's difficult to see how it would be useful in practice, given that people usually don't know about or care about the sign that their NaN has.)

Maybe the solution would be to provide an official "math.total_order" function that can be used as a key:

    sorted(my_list_of_floats_with_nans_in_it, key=math.total_order)
msg336472 - (view) Author: Brandt Bucher (brandtbucher) * (Python triager) Date: 2019-02-24 14:22
One other idea I had considered was having a new magic method intended to be used for elementwise comparisons such as these (I’m thinking specifically of the old __cmp__/tp_compare). Sorting routines could check for this method on each element, but fall back on __lt__ if it is missing or NotImplemented. 

This would allow us to define an ordering for types independent of the rich comparison behavior. Simply defining it for floats and adding quick patches for list.sort/min/max/... would provide the language (and users) with that flexibility.
msg336487 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-02-24 22:58
> *Clearly*, negative NaNs should be sorted to the start
> of the list, while positive NaNs are sorted to the end
> of the list. (Yes, I'm joking, but only a little bit: that's
> the result you'd get if you used IEEE 754's totalOrder
> to sort.

So not a joke at all!  If a revision of the standard mandates a specific total ordering, it's certain that people will eventually ask for it be supported.

> But it's difficult to see how it would be useful in
> practice, given that people usually don't know about
> or care about the sign that their NaN has.)

The standard "wastes" an enormous number of bit patterns on NaNs, catering to some hypothetical system that uses those bits to encode diagnostic information (e.g., maybe the source code line number of the operation that produced the NaN), and this is just one more "wasted" bit.  If any implementation actually used any of these possibilities, I'm sure semi-plausible ;-) use cases would follow.

I bet this was driven by implementation - it's _almost_ what you'd get if you viewed the float bit patterns as signed integers.  Seems the only flaw with that is that the negative floats are in "reverse order" then.  Which can be repaired easily enough; e.g.,

    import struct
    def totkey(x, *, MASK=(1 << 63) - 1, pack=struct.pack):
        asint = int.from_bytes(pack(">d", x), "big", signed=True)
        if asint < 0:
            asint ^= MASK
        return asint

Then

    xs = [2.0, 1.0, 0.0, -0.0, -1.0, -2.0,
          math.inf, -math.inf, math.nan, -math.nan]
    print(sorted(xs, key=totkey))

displays

    [nan, -inf, -2.0, -1.0, -0.0, 0.0, 1.0, 2.0, inf, nan]

That makes negative NaNs smallest, positive NaNs largest, puts -0 before +0, and in all other respects I checked ;-) appears to match the totalOrder requirements.  On most boxes, it will even consider (positive) signaling NaNs to "be smaller" than (positive) quiet NaNs, which totalOrder also seems to require (presumably again not because it's obviously useful, but because it _follows_ from a very simple implementation like the one above).
msg358440 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-15 18:46
I'm in favor of a `math.total_ordering` function, but IMHO sorting functions should emit a warning if they contains an unorderable objects.

This can be done easily: I suppose the sorting function checks if the objects of the iterable are minor that another object. And soon or later, this check will be done for all objects in the iterable.

What I propose is simply to add a flag `all_orderables`, true by default.

After the current object a is checked against the object b, if `all_orderables` is true, another check will be performed: `a >= b`?

If this return false, `all_orderables` will be set to false.

At the end of sorting, if `all_orderables` is false, a warning will be emitted.

This way, no old code will break, but developers will be informed of the problem. Because "Errors should never pass silently" :)
msg358441 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-15 18:52
Excuse me, a little errata:

> After the current object a is checked against the object b, if `all_orderables` is true [...]

must be change to

> After the current object a is checked against the object b, ***if the check returns false and*** if `all_orderables` is true [...]
msg358442 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-12-15 19:43
> IMHO sorting functions should emit a warning if they contains an unorderable objects

How do you define "unorderable", and how do you propose to detect "unorderable objects" efficiently in practice within the sorting algorithm?

What you propose isn't practical in full generality. The case in which `sorted` can't give a well-defined result is the case where the collection of elements being sorted, under the given ordering, is not a total preorder. Being a total preorder is a property of the whole collection being sorted, not of individual objects: it's not necessarily true that if the collection is not totally preordered then there's any one item you can point to as being "unorderable".

Checking for a total preorder is much more expensive than simply sorting (at best it would be a quadratic time post-sort check), so that's not a reasonable check to make.

So by checking for unorderable objects, whatever those are, you're only solving a small part of the overall problem, at the expense of extra computation. You're also likely breaking the existing guarantees (depending on how you do this checking) that `sort` and `sorted` only rely on `<` comparisons; that would be a backwards compatibility break.
msg358443 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-12-15 19:44
Tim, Raymond: I propose that we close this issue as "won't fix". The suggestion to add `math.total_ordering` could be broken out into a separate issue.
msg358445 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-12-15 20:20
Closing as Mark suggested, but as "not a bug" rather than "won't fix".  That floats aren't totally ordered is a consequence of IEEE 754 semantics, not Python's whim.  As Mark said, if people want a total_ordering function for floats, that should be opened as a different issue - we're not going to change the default float comparison logic.
msg358447 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-15 21:36
Excuse me, but have you, Dickinson and Peters, read how I propose to check if the object is orderable or not? 

I explained it in a very detailed way, and this does not change the float comparison. And does not need to check first if the iterable it totally preordered.

Can you please read my post?
msg358450 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-15 21:55
Anyway, Java by default puts NaNs at the end of the iterable:

https://onlinegdb.com/SJjuiXE0S
msg358451 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-12-15 21:56
Marco, your

> I suppose the sorting function checks if the objects of
> the iterable are minor that another object

was incoherent to me.  No idea what "are minor that another object" could possibly mean.

As Mark explained, the mathematical meaning of "orderable" is more expensive to check than it is to do sorting.

Mark also explained that list.sort() guarantees to use only "<" (__lt__) comparisons.  Because your message was incoherent to me (see above), I don't know what purpose would be served by checking ">=" too, but if there _is_ a coherent purpose, list.sort() cannot use ">=" regardless.

Reply to Mark's message instead of this one?  You haven't addressed any of the points he raised, and they're all deal-breakers.
msg358456 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-15 22:19
> No idea what "are minor that another object" could possibly mean.

Oh my god... a < b?

> I don't know what purpose would be served by checking ">=" too

Well, it's very simple. Since the sorting algorithm checks if a < b, if this check fails, I propose to check also a >= b. If this is false too, the iterable contains an unorderable object. From this point, the check will never done again, an unorderable object is sufficient to raise the warning.

The check a >= b is *not* for ordering the iterable, is only for checking if the elements are orderable or not, and raise the warning.

Furthermore, I suppose that if someone is sure that its iterable is unorderable-free and want a fine-grained boost to speed, a flag can added. If true, sorting will not use the algorithm with the check, but the old algorithm.

> You haven't addressed any of the points he (Dickinson) raised

Dickinson said you have to check for total preorder. If you have understood my idea, this is not needed at all.
msg358462 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-12-16 02:23
Ah, so you're proposing a hack that would catch some, but not all, cases where a total ordering doesn't exist.  Why?  The issue tracker isn't a suitable place to discuss it regardless, so if you want to pursue it I'd suggest taking it to python-ideas.  If you do, some points to think about in advance:

- As above, it only catches some cases.  For example, given list [a, b], where a < b and b < a, no total ordering exists but the hack won't detect anything amiss.

- As has been said twice already, list.sort() promises to use only "<".  This isn't disposable.  User-written classes _rely_ on it, which was the point:  for instances to participate in sorting, the class only needs to define __lt__ (and, although it's not currently documented, they can also participate in the `bisect` and `heapq` module facilities, which also restrict themselves to "<" compares).

- Assuming "a < b" is false about half the time, the hack would require sorting to do about 50% more comparisons.  That's a huge expense.  All in all, I've spent at least a year of my life minimizing the number of compares Python's sort needs to do, and that would more than wipe out all the progress I've made since 1991 ;-)
msg358464 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-16 05:17
Excuse me, I had an epiphany.

NaN returns False for every comparison.

So in teory any element of the iterable should result minor that NaN.

So NaN should treated as the highest element, and should be at the end of the sorted result!

Indeed this is the behavior in Java. NaNs are in the end of the sorted iterator.

On the contrary, Python sorting does not move the NaN from its position.

Why?
msg358465 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2019-12-16 05:26
Sorry, but I'm not replying here any more:  the issue tracker is not for asking questions or brainstorming ideas.  Take this to python-ideas, please.  If a concrete proposal that gains traction results, _then_ the issue tracker may be an appropriate place (but a new issue - this issue is closed).
msg358834 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-23 20:51
marco@buzz:~$ python3.9
Python 3.9.0a0 (heads/master-dirty:d8ca2354ed, Oct 30 2019, 20:25:01) 
[GCC 9.2.1 20190909] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from decimal import Decimal as Dec, BasicContext as Bctx
>>> a = Dec("1981", Bctx)
>>> b = Dec("nan", Bctx)
>>> a.max(b)
Decimal('1981')
>>> b.max(a)
Decimal('1981')
>>> Bctx.max(a, b)
Decimal('1981')
>>> Bctx.max(b, a)
Decimal('1981')
msg358835 - (view) Author: Marco Sulla (Marco Sulla) Date: 2019-12-23 21:00
Excuse me, ignore my previous post.
History
Date User Action Args
2019-12-23 21:00:15Marco Sullasetmessages: + msg358835
2019-12-23 20:51:42Marco Sullasetmessages: + msg358834
2019-12-16 05:26:46tim.peterssetmessages: + msg358465
2019-12-16 05:17:23Marco Sullasetmessages: + msg358464
2019-12-16 02:23:07tim.peterssetmessages: + msg358462
2019-12-15 22:19:59Marco Sullasetmessages: + msg358456
2019-12-15 21:56:37tim.peterssetmessages: + msg358451
2019-12-15 21:55:04Marco Sullasetmessages: + msg358450
2019-12-15 21:36:44Marco Sullasetmessages: + msg358447
2019-12-15 20:20:51tim.peterssetstatus: open -> closed
versions: + Python 2.7, - Python 3.8
messages: + msg358445

resolution: not a bug
stage: patch review -> resolved
2019-12-15 19:44:35mark.dickinsonsetmessages: + msg358443
2019-12-15 19:43:12mark.dickinsonsetmessages: + msg358442
2019-12-15 18:52:34Marco Sullasetmessages: + msg358441
2019-12-15 18:46:53Marco Sullasetnosy: + Marco Sulla
messages: + msg358440
2019-02-24 22:58:29tim.peterssetmessages: + msg336487
2019-02-24 14:22:12brandtbuchersetmessages: + msg336472
2019-02-24 10:37:48mark.dickinsonsetmessages: + msg336457
2019-02-24 04:36:01tim.peterssetmessages: + msg336432
2019-02-24 02:55:59rhettingersetmessages: + msg336421
2019-02-23 23:45:28brandtbuchersetmessages: + msg336418
2019-02-23 22:42:07rhettingersetnosy: + mark.dickinson
messages: + msg336412
2019-02-23 21:49:42brandtbuchersetmessages: + msg336407
2019-02-23 21:14:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg336405
2019-02-23 19:58:39cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg336402
2019-02-23 19:46:23xtreaksetnosy: + tim.peters, rhettinger
2019-02-23 19:39:36brandtbuchersetkeywords: + patch
stage: patch review
pull_requests: + pull_request12031
2019-02-23 19:39:17brandtbuchercreate