Issue35698

Created on **2019-01-09 11:46** by **jfine2358**, last changed **2019-01-14 13:26** by **vstinner**. This issue is now **closed**.

Messages (15) | |||
---|---|---|---|

msg333305 - (view) | Author: Jonathan Fine (jfine2358) * | Date: 2019-01-09 11:46 | |

When len(data) is odd, median returns the average of the two middle values. This average is computed using i = n//2 return (data[i - 1] + data[i])/2 This results in the following behaviour >>> from fractions import Fraction >>> from statistics import median >>> F1 = Fraction(1, 1) >>> median([1]) 1 >>> median([1, 1]) # Example 1. 1.0 >>> median([F1]) Fraction(1, 1) >>> median([F1, F1]) Fraction(1, 1) >>> median([2, 2, 1, F1]) # Example 2. Fraction(3, 2) >>> median([2, 2, F1, 1]) # Example 3. 1.5 Perhaps, when len(data) is odd, it would be better to test the two middle values for equality. This would resolve Example 1. It would not help with Examples 2 and 3, which might not have a satisfactory solution. See also issue 33084. |
|||

msg333309 - (view) | Author: STINNER Victor (vstinner) * | Date: 2019-01-09 11:58 | |

> When len(data) is odd, median returns the average of the two middle values. I'm not sure that I understand your issue. Do you consider that it's a bug? It's part of the definition of the median function, no? https://en.wikipedia.org/wiki/Median#Finite_set_of_numbers |
|||

msg333327 - (view) | Author: Rémi Lapeyre (remi.lapeyre) * | Date: 2019-01-09 15:44 | |

What do you think median([1, 1.0]) should return? |
|||

msg333340 - (view) | Author: Josh Rosenberg (josh.r) * | Date: 2019-01-09 18:05 | |

vstinner: The problem isn't the averaging, it's the type inconsistency. In both examples (median([1]), median([1, 1])), the median is unambiguously 1 (no actual average is needed; the values are identical), yet it gets converted to 1.0 only in the latter case. I'm not sure it's possible to fix this though; right now, there is consistency among two cases: 1. When the length is odd, you get the median by identity (and therefore type and value are unchanged) 2. When the length is even, you get the median by adding and dividing by 2 (so for ints, the result is always float). A fix that changed that would add yet another layer of complexity: 1. When the length is odd, you get the median by identity (and therefore type and value are unchanged) 2. When the length is even, a. If the two middle values are equal (possibly only if they have equal types as well, to resolve the issue with [1, 1.0] or [1, True]), return the first of the two middle values (median by identity as in #1) b. Otherwise, you get the median by adding and dividing by 2 And note the required type checking in 2a required to even make it that consistent. Even if we accepted that, we'd pretty quickly get into a debate over whether median([3, 5]) should try to return 4 instead of 4.0, given that the median is representable in the source type (which would further damage consistency). If anything, I think the best design would have been to *always* include a division step (so odd length cases performed middle_elem / 1, while even did (middle_elem1 + middle_elem2) / 2) so the behavior was consistent regardless odd vs. even input length, but that shipped has probably sailed, given the documented behavior specifically notes that the precise middle data point is itself returned for the odd case. I think the solution for people concerned is to explicitly convert int values to be median-ed to fractions.Fraction (or decimal.Decimal) ahead of time, so floating point math never gets involved, and the return type is consistent regardless of length. |
|||

msg333349 - (view) | Author: STINNER Victor (vstinner) * | Date: 2019-01-09 22:15 | |

> vstinner: The problem isn't the averaging, it's the type inconsistency. >>> type(statistics.median([1])) <class 'int'> >>> type(statistics.median([1,2])) <class 'float'> Which consistency? :-) |
|||

msg333374 - (view) | Author: Jonathan Fine (jfine2358) * | Date: 2019-01-10 12:18 | |

I read PEP 450 as saying that statistics.py can be used by "any secondary school student". This is not true for most Python libraries. In this context, the difference between a float and an int is important. Consider statistics.median([2] * n) As a secondary school student, knowing the definition of median, I might expect the value to be 2, for any n > 0. What else could it be. However, the present code gives 2 for n odd, and 2.0 for n even. I think that this issue is best approached by taking the point of view of a secondary school student. Or perhaps even a primary school student who knows fractions. (A teacher might use statistics.py to create learning materials.) By the way, 2 and 2.0 are not interchangeable. For example >>> [1] * 2.0 TypeError: can't multiply sequence by non-int of type 'float' |
|||

msg333385 - (view) | Author: Rémi Lapeyre (remi.lapeyre) * | Date: 2019-01-10 15:30 | |

> As a secondary school student, knowing the definition of median, I might expect the value to be 2, for any n > 0. The secondary school student would be wrong, wouldn't he? The median of a set is not expected to be a part of the set. Especially for ints since division by 1 or 2 is not closed for integers. Would the same student expect median([2, 4, 6, 8]) to be part of the set of even integers? I think one taking the median of a set should always ready to deal with floating point arithmetic the result is not guaranteed to be an integer. Going from hoops to make it so when it is equivalent to an integer is rather misleading. |
|||

msg333386 - (view) | Author: STINNER Victor (vstinner) * | Date: 2019-01-10 15:33 | |

I suggest to close the issue as "not a bug". IMHO statistics.median() respects the defintion of the mathematical median function. |
|||

msg333398 - (view) | Author: Jonathan Fine (jfine2358) * | Date: 2019-01-10 16:29 | |

Here's the essence of a patch. Suppose the input is Python integers, and the output is a mathematical integer. In this case we can make the output a Python integer by using the helper function >>> def wibble(p, q): ... if type(p) == type(q) == int and p%q == 0: ... return p // q ... else: ... return p / q ... >>> wibble(4, 2) 2 >>> wibble(3, 2) 1.5 This will also work for average. |
|||

msg333400 - (view) | Author: Rémi Lapeyre (remi.lapeyre) * | Date: 2019-01-10 16:44 | |

This does not do what you want: >>> class MyInt(int): pass >>> wibble(MyInt(4), MyInt(2)) 2.0 and a patch is only needed if something is broken. I'm with vstinner of the opinion that nothing is broken and vote to close this issue. |
|||

msg333410 - (view) | Author: Jonathan Fine (jfine2358) * | Date: 2019-01-10 17:14 | |

It might be better in my sample code to write isinstance(p, int) instead of type(p) == int This would fix Rémi's example. (I wanted to avoid thinking about (False // True).) For median([1, 1]), I am not claiming that 1.0 is wrong and 1 is right. I'm not saying the module is broken, only that it can be improved. For median([1, 1]), I believe that 1 is a better answer, particularly for school students. In other words, that making this change would improve Python. As a pure mathematician, to me 1.0 means a number that is close to 1. Whereas 1 means a number that is exactly 1.. |
|||

msg333539 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2019-01-13 04:12 | |

> As a pure mathematician, to me 1.0 means a number that is > close to 1. Whereas 1 means a number that is exactly 1. Descriptive statistics performed on a computer using actual measurements is pretty far from "pure mathematics" ;-) Making this change is likely pointless for most users and likely confusing for others (i.e. why the type switch between median([1, 1]) and median([1, 3]). I concur with Victor and recommend closing. |
|||

msg333618 - (view) | Author: Steven D'Aprano (steven.daprano) * | Date: 2019-01-14 13:08 | |

I agree that for numeric data, it isn't worth changing the behaviour of median to avoid the division in the case of two equal middle values. Even if we did accept this feature request, it is not going to eliminate the change in type in all circumstances. median([1, 2]) will still return 1.5. And in practical terms, the conditions where this would apply are likely to be quite unusual for numeric data. (Ordinal data is likely to be a different story.) One way or another, the caller has to expect that the median of an even number of ints may return a number which is a float. If the caller doesn't want that behaviour, they can use median_low or median_high, which never take the average and always return a value from the data set. |
|||

msg333619 - (view) | Author: Jonathan Fine (jfine2358) * | Date: 2019-01-14 13:20 | |

I'm still thinking about this. I find Steve's closing of the issue premature, but I'm not going to reverse it. |
|||

msg333620 - (view) | Author: STINNER Victor (vstinner) * | Date: 2019-01-14 13:26 | |

> I find Steve's closing of the issue premature, but I'm not going to reverse it. Steven D'Aprano is the maintainer of the module (he wrote 450 and implemented it), he has the last word. Steven D'Aprano, Raymond Hettinger and me are 3 core developers and in favor of closing the issue. |

History | |||
---|---|---|---|

Date | User | Action | Args |

2019-01-14 13:26:42 | vstinner | set | messages: + msg333620 |

2019-01-14 13:20:16 | jfine2358 | set | messages: + msg333619 |

2019-01-14 13:08:20 | steven.daprano | set | status: open -> closed resolution: not a bug messages: + msg333618 stage: resolved |

2019-01-13 04:12:59 | rhettinger | set | assignee: steven.daprano messages: + msg333539 nosy: + rhettinger |

2019-01-10 17:58:55 | brett.cannon | set | title: Division by 2 in statistics.median -> [statistics] Division by 2 in statistics.median |

2019-01-10 17:58:32 | brett.cannon | set | nosy:
+ steven.daprano |

2019-01-10 17:14:59 | jfine2358 | set | messages: + msg333410 |

2019-01-10 16:44:08 | remi.lapeyre | set | messages: + msg333400 |

2019-01-10 16:29:18 | jfine2358 | set | messages: + msg333398 |

2019-01-10 15:33:22 | vstinner | set | messages: + msg333386 |

2019-01-10 15:30:27 | remi.lapeyre | set | messages: + msg333385 |

2019-01-10 12:18:48 | jfine2358 | set | messages: + msg333374 |

2019-01-09 22:15:23 | vstinner | set | messages: + msg333349 |

2019-01-09 18:05:04 | josh.r | set | nosy:
+ josh.r messages: + msg333340 |

2019-01-09 15:44:15 | remi.lapeyre | set | nosy:
+ remi.lapeyre messages: + msg333327 |

2019-01-09 11:58:33 | vstinner | set | nosy:
+ vstinner messages: + msg333309 |

2019-01-09 11:46:12 | jfine2358 | create |