classification
Title: statistics module - incorrect results with boolean input
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steven.daprano Nosy List: della, mark.dickinson, r.david.murray, rhettinger, steven.daprano, wolma
Priority: normal Keywords: patch

Created on 2015-04-28 08:53 by wolma, last changed 2018-04-08 20:03 by wolma. This issue is now closed.

Files
File name Uploaded Description Edit
statistics._sum.patch wolma, 2015-04-28 08:56 review
statistics._sum.v2.patch wolma, 2015-05-02 20:05 review
Messages (6)
msg242169 - (view) Author: Wolfgang Maier (wolma) * Date: 2015-04-28 08:53
the mean function in the statistics module gives nonsensical results with boolean values in the input, e.g.:

>>> mean([True, True, False, False])
0.25

>>> mean([True, 1027])
0.5

This is an issue with the module's internal _sum function that mean relies on. Other functions relying on _sum are affected more subtly, e.g.:

>>> variance([1, 1027, 0])
351234.3333333333

>>> variance([True, 1027, 0])
351234.3333333334

The problem with _sum is that it will try to coerce its result to any non-int type found in the input (so bool in the examples), but bool(1028) is just True so information gets lost.

I've attached a patch preventing the type cast when it would be to bool.
I don't have time to write a separate test though so if somebody wants to take over .. :)
msg242362 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-05-02 01:00
I wonder if it would be better to reject Bool data in this context?  Bool is only a numeric type for historical reasons.
msg242370 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-05-02 02:20
The patch seems simple and straightforward enough. It just needs some tests, and a Round Tuit.
msg242428 - (view) Author: Wolfgang Maier (wolma) * Date: 2015-05-02 20:05
uploading an alternate, possibly slightly clearer version of the patch
msg242451 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-05-03 06:09
> I wonder if it would be better to reject Bool data in this context?

It's not uncommon (and quite useful) in NumPy world to compute basic statistics on arrays of boolean dtype: the sum of such an array gives a count of the `True`s, and the mean gives the proportion of `True` entries.  I think it would be handy to allow the statistics module to work with lists of bools, if possible.
msg315095 - (view) Author: Wolfgang Maier (wolma) * Date: 2018-04-08 20:03
Fixed as part of resolving issue 25177.
History
Date User Action Args
2018-04-08 20:03:14wolmasetstatus: open -> closed
resolution: fixed
messages: + msg315095

stage: test needed -> resolved
2016-05-02 21:41:14r.david.murraylinkissue26913 superseder
2015-05-20 12:48:52dellasetnosy: + della
2015-05-11 06:25:01rhettingersetnosy: + rhettinger
2015-05-03 06:09:27mark.dickinsonsetnosy: + mark.dickinson
messages: + msg242451
2015-05-02 20:05:30wolmasetfiles: + statistics._sum.v2.patch

messages: + msg242428
2015-05-02 02:20:49steven.dapranosetstage: test needed
2015-05-02 02:20:27steven.dapranosetassignee: steven.daprano
messages: + msg242370
2015-05-02 01:00:15r.david.murraysetnosy: + r.david.murray
messages: + msg242362
2015-04-28 08:56:15wolmasetfiles: + statistics._sum.patch
2015-04-28 08:54:54wolmasetfiles: - statistics._sum.patch
2015-04-28 08:53:26wolmacreate