classification
Title: [statistics] Division by 2 in statistics.median
Type: behavior Stage: resolved
Components: Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: jfine2358, josh.r, remi.lapeyre, rhettinger, steven.daprano, vstinner
Priority: normal Keywords:

Created on 2019-01-09 11:46 by jfine2358, last changed 2019-01-14 13:26 by vstinner. This issue is now closed.

Messages (15)
msg333305 - (view) Author: Jonathan Fine (jfine2358) * Date: 2019-01-09 11:46
When len(data) is odd, median returns the average of the two middle values. This average is computed using
        i = n//2
        return (data[i - 1] + data[i])/2

This results in the following behaviour

>>> from fractions import Fraction
>>> from statistics import median
>>> F1 = Fraction(1, 1)

>>> median([1])
1
>>> median([1, 1]) # Example 1.
1.0

>>> median([F1])
Fraction(1, 1)
>>> median([F1, F1])
Fraction(1, 1)

>>> median([2, 2, 1, F1]) # Example 2.
Fraction(3, 2)

>>> median([2, 2, F1, 1]) # Example 3.
1.5

Perhaps, when len(data) is odd, it would be better to test the two middle values for equality. This would resolve Example 1. It would not help with Examples 2 and 3, which might not have a satisfactory solution.

See also issue 33084.
msg333309 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-01-09 11:58
> When len(data) is odd, median returns the average of the two middle values.

I'm not sure that I understand your issue. Do you consider that it's a bug? It's part of the definition of the median function, no?
https://en.wikipedia.org/wiki/Median#Finite_set_of_numbers
msg333327 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-01-09 15:44
What do you think median([1, 1.0]) should return?
msg333340 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-01-09 18:05
vstinner: The problem isn't the averaging, it's the type inconsistency. In both examples (median([1]), median([1, 1])), the median is unambiguously 1 (no actual average is needed; the values are identical), yet it gets converted to 1.0 only in the latter case.

I'm not sure it's possible to fix this though; right now, there is consistency among two cases:

1. When the length is odd, you get the median by identity (and therefore type and value are unchanged)
2. When the length is even, you get the median by adding and dividing by 2 (so for ints, the result is always float).

A fix that changed that would add yet another layer of complexity:

1. When the length is odd, you get the median by identity (and therefore type and value are unchanged)
2. When the length is even, 
  a. If the two middle values are equal (possibly only if they have equal types as well, to resolve the issue with [1, 1.0] or [1, True]), return the first of the two middle values (median by identity as in #1)
  b. Otherwise, you get the median by adding and dividing by 2

And note the required type checking in 2a required to even make it that consistent. Even if we accepted that, we'd pretty quickly get into a debate over whether median([3, 5]) should try to return 4 instead of 4.0, given that the median is representable in the source type (which would further damage consistency).

If anything, I think the best design would have been to *always* include a division step (so odd length cases performed middle_elem / 1, while even did (middle_elem1 + middle_elem2) / 2) so the behavior was consistent regardless odd vs. even input length, but that shipped has probably sailed, given the documented behavior specifically notes that the precise middle data point is itself returned for the odd case.

I think the solution for people concerned is to explicitly convert int values to be median-ed to fractions.Fraction (or decimal.Decimal) ahead of time, so floating point math never gets involved, and the return type is consistent regardless of length.
msg333349 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-01-09 22:15
> vstinner: The problem isn't the averaging, it's the type inconsistency.

>>> type(statistics.median([1]))
<class 'int'>
>>> type(statistics.median([1,2]))
<class 'float'>

Which consistency? :-)
msg333374 - (view) Author: Jonathan Fine (jfine2358) * Date: 2019-01-10 12:18
I read PEP 450 as saying that statistics.py can be used by "any secondary school student". This is not true for most Python libraries.

In this context, the difference between a float and an int is important. Consider
   statistics.median([2] * n)

As a secondary school student, knowing the definition of median, I might expect the value to be 2, for any n > 0. What else could it be. However, the present code gives 2 for n odd, and 2.0 for n even.

I think that this issue is best approached by taking the point of view of a secondary school student. Or perhaps even a primary school student who knows fractions. (A teacher might use statistics.py to create learning materials.)

By the way, 2 and 2.0 are not interchangeable. For example
>>> [1] * 2.0
TypeError: can't multiply sequence by non-int of type 'float'
msg333385 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-01-10 15:30
> As a secondary school student, knowing the definition of median, I might expect the value to be 2, for any n > 0.

The secondary school student would be wrong, wouldn't he?

The median of a set is not expected to be a part of the set. Especially for ints since division by 1 or 2 is not closed for integers.

Would the same student expect median([2, 4, 6, 8]) to be part of the set of even integers?

I think one taking the median of a set should always ready to deal with floating point arithmetic the result is not guaranteed to be an integer. Going from hoops to make it so when it is equivalent to an integer is rather misleading.
msg333386 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-01-10 15:33
I suggest to close the issue as "not a bug". IMHO statistics.median() respects the defintion of the mathematical median function.
msg333398 - (view) Author: Jonathan Fine (jfine2358) * Date: 2019-01-10 16:29
Here's the essence of a patch.

Suppose the input is Python integers, and the output is a mathematical integer. In this case we can make the output a Python integer by using the helper function

>>> def wibble(p, q):
...     if type(p) == type(q) == int and p%q == 0:
...         return p // q
...     else:
...         return p / q
... 
>>> wibble(4, 2)
2
>>> wibble(3, 2)
1.5

This will also work for average.
msg333400 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2019-01-10 16:44
This does not do what you want:

    >>> class MyInt(int): pass
    >>> wibble(MyInt(4), MyInt(2))
    2.0

and a patch is only needed if something is broken.

I'm with vstinner of the opinion that nothing is broken and vote to close this issue.
msg333410 - (view) Author: Jonathan Fine (jfine2358) * Date: 2019-01-10 17:14
It might be better in my sample code to write
    isinstance(p, int)
instead of
    type(p) == int
This would fix Rémi's example. (I wanted to avoid thinking about (False // True).)

For median([1, 1]), I am not claiming that 1.0 is wrong and 1 is right. I'm not saying the module is broken, only that it can be improved.

For median([1, 1]), I believe that 1 is a better answer, particularly for school students. In other words, that making this change would improve Python.

As a pure mathematician, to me 1.0 means a number that is close to 1. Whereas 1 means a number that is exactly 1..
msg333539 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-01-13 04:12
> As a pure mathematician, to me 1.0 means a number that is 
> close to 1. Whereas 1 means a number that is exactly 1.

Descriptive statistics performed on a computer using actual measurements is pretty far from "pure mathematics" ;-)

Making this change is likely pointless for most users and likely confusing for others (i.e. why the type switch between median([1, 1]) and median([1, 3]).

I concur with Victor and recommend closing.
msg333618 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-01-14 13:08
I agree that for numeric data, it isn't worth changing the behaviour of median to avoid the division in the case of two equal middle values.

Even if we did accept this feature request, it is not going to eliminate the change in type in all circumstances. median([1, 2]) will still return 1.5. And in practical terms, the conditions where this would apply are likely to be quite unusual for numeric data. (Ordinal data is likely to be a different story.)

One way or another, the caller has to expect that the median of an even number of ints may return a number which is a float. If the caller doesn't want that behaviour, they can use median_low or median_high, which never take the average and always return a value from the data set.
msg333619 - (view) Author: Jonathan Fine (jfine2358) * Date: 2019-01-14 13:20
I'm still thinking about this.

I find Steve's closing of the issue premature, but I'm not going to reverse it.
msg333620 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-01-14 13:26
> I find Steve's closing of the issue premature, but I'm not going to reverse it.

Steven D'Aprano is the maintainer of the module (he wrote 450 and implemented it), he has the last word.

Steven D'Aprano, Raymond Hettinger and me are 3 core developers and in favor of closing the issue.
History
Date User Action Args
2019-01-14 13:26:42vstinnersetmessages: + msg333620
2019-01-14 13:20:16jfine2358setmessages: + msg333619
2019-01-14 13:08:20steven.dapranosetstatus: open -> closed
resolution: not a bug
messages: + msg333618

stage: resolved
2019-01-13 04:12:59rhettingersetassignee: steven.daprano

messages: + msg333539
nosy: + rhettinger
2019-01-10 17:58:55brett.cannonsettitle: Division by 2 in statistics.median -> [statistics] Division by 2 in statistics.median
2019-01-10 17:58:32brett.cannonsetnosy: + steven.daprano
2019-01-10 17:14:59jfine2358setmessages: + msg333410
2019-01-10 16:44:08remi.lapeyresetmessages: + msg333400
2019-01-10 16:29:18jfine2358setmessages: + msg333398
2019-01-10 15:33:22vstinnersetmessages: + msg333386
2019-01-10 15:30:27remi.lapeyresetmessages: + msg333385
2019-01-10 12:18:48jfine2358setmessages: + msg333374
2019-01-09 22:15:23vstinnersetmessages: + msg333349
2019-01-09 18:05:04josh.rsetnosy: + josh.r
messages: + msg333340
2019-01-09 15:44:15remi.lapeyresetnosy: + remi.lapeyre
messages: + msg333327
2019-01-09 11:58:33vstinnersetnosy: + vstinner
messages: + msg333309
2019-01-09 11:46:12jfine2358create