Issue 20481: Clarify type coercion rules in statistics module

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64680

classification

Title:	Clarify type coercion rules in statistics module
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ncoghlan	Nosy List:	gregory.p.smith, larry, ncoghlan, oscarbenjamin, python-dev, steven.daprano, wolma
Priority:	release blocker	Keywords:	patch

Created on 2014-02-02 01:31 by ncoghlan, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
_sum_coerce.py	wolma, 2014-02-02 07:45	provides _sum and _coerce_types to replace the same functions in statistics.py, requires import numbers to work
statistics.patch	wolma, 2014-02-04 16:45		review
stats.patch	steven.daprano, 2014-02-07 21:04		review

Messages (21)
msg209934 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-02 01:31
I haven't completely following the type coercion discussion on python-ideas. but the statistics module at least needs a docs clarification (to explain that the current behaviour when mixing input types is not fully defined, especially when Decimal is involved), and potentially a behavioural change to disallow certain type combinations where the behaviour may change in the future (see https://mail.python.org/pipermail/python-ideas/2014-February/025214.html for example) Either option seems reasonable to me (with a slight preference for the latter), but it's at least clear that we need to avoid locking ourselves into the exact coercion behaviour of the current implementation indefinitely.
msg209956 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-02 07:45
Thanks Nick for filing this! I've been working on modifications to statistics._sum and statistics._coerce_types that together make the module's behaviour independent of the order of input types (by making the decision based on the set of input types) and, specifically, disallow certain combinations involving Decimal reliably. Specifically, my modified version, like the original, returns the input type if there is only one; for mixed input types, differently than the original, it tries to return the most basic representative of the most narrow number type from the numbers tower (i.e., int if all input types are Integral, Fraction if all are Rational, float with all Real); Decimal is treated in the following way: if Decimal is the only input type or if the input types consist of only Decimal and Integral, Decimal is returned; my two versions of _coerce_types differ in that the first raises TypeError with any other combination involving Decimal, the second allows combinations of Decimal with float (returning float) and raises TypeError with all others. I sent the first version to Steven D'Aprano and Oscar Benjamin for review.
msg209976 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2014-02-02 12:33
Wolfgang have you tested this with any third party numeric types from sympy, gmpy2, mpmath etc.? Last I checked no third party types implement the numbers ABCs e.g.: >>> import sympy, numbers >>> r = sympy.Rational(1, 2) >>> r 1/2 >>> isinstance(r, numbers.Rational) False AFAICT testing against the numbers ABCs is just a slow way of testing against the stdlib types: $ python -m timeit -s 'from numbers import Integral' 'isinstance(1, Integral)' 100000 loops, best of 3: 2.59 usec per loop $ python -m timeit -s 'from numbers import Integral' 'isinstance(1, int)' 1000000 loops, best of 3: 0.31 usec per loop You can at least make it faster using a tuple: $ python -m timeit -s 'from numbers import Integral' 'isinstance(1, (int, Integral))' 1000000 loops, best of 3: 0.423 usec per loop I'm not saying that this is necessarily a worthwhile optimisation but rather that the numbers ABCs are in practice not really very useful (AFAICT). I don't know how well the statistics module currently handles third party numeric types but if the type coercion is to be changed then this something to look at. The current implementation tries to duck-type to some extent and yours uses ABCs but does either approach actually have any practical gain for interoperability with non-stdlib numeric types? If not then it would be simpler just to explicitly hard-code exactly how it works for the powerset of stdlib types. OTOH if it could be made to do sensible things with non-stdlib types then that would be great. Falling back on float isn't a bad choice but if it can be made to do exact things for exact types (e.g. sympy.Rational) then that would be great. Similarly mpmath.mpf provides multi-precision floats. It would be great to be able to take advantage of that higher precision rather than downgrade everything to float. This is in general a hard problem though so I don't think it's unreasonable to make restrictions about what types it can work with - achieving optimal accuracy for all types without restrictions is basically impossible.
msg210030 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-02 22:04
Hi Oscar, well, I haven't used sympy much, and I have no experience with the others, but in light of your comment I quickly checked sympy and gmpy2. You are right about them still not using the numbers ABCs, however, on your advise I also checked how the current statistics module implementation handles their numeric types and the answer is: it doesn't, and this is totally independent of the _coerce_types issue. For sympy: the problem lies with statistics._exact_ratio, which cannot convert sympy numeric types to a numerator/denominator tuple (a prerequisite for _sum) For gmpy2: the problem occurs just one step further down the road. gmpy2.Rationals have numerator and denominator properties, so _exact_ratio knows how to handle them, but the elements of the returned tuple are of type gmpy2.mpz (gmpy2's integer equivalent) and when _sum tries to convert the tuple into a Fraction you get: TypeError: both arguments should be Rational instances which is precisely because the mpz type is not integrated into the numbers tower. This last example is very illustrative I think because it shows that already now the standard library (the fractions module in this case) requires numeric types to comply with the numeric tower, so statistics would not be without precedent, and I think this is totally justified: after all this is the standard library (can't believe I'm saying this since I really got into this sort of by accident) and third party libraries should seek compatibility, but the standard library just needs to be self-consistent. I guess using ABCs over a duck-typing approach when coercing types, in fact, offers a huge advantage for third party libraries since they only need to register their types with the numbers ABC to achieve compatibility, while they need to consider complicated subclassing schemes with the current approach (of course, I am only talking about compatibility with _coerce_types here, which is the focus of this issue. Other parts of statistics may impose further restrictions as we've just seen for _sum). Finally, regarding speed. The fundamental difference between the current implementation and my proposed change is that the current version calls _coerce_types for every number in the input sequence, so performance is critical here, but in my version _coerce_types gets called only once and then operates on a really small set of input types, so it is absolutely not the time-critical step in the overall performance of _sum. For this very reason I made no effort at all to optimize the code, but just tried to keep it as simple and clear as possible. This, in fact, is IMHO the second major benefit of my proposal for _coerce_types (besides making its result order-independent). Read the current code for _coerce_types, then the proposed one. Try to consider all their ramifications and side-effects and decide which one's easier to understand and maintain. Best, Wolfgang
msg210113 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2014-02-03 11:11
It's not as simple as registering with an ABC. You also need to provide the interface that the ABC represents: >>> import sympy >>> r = sympy.Rational(1, 2) >>> r 1/2 >>> r.numerator Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'Half' object has no attribute 'numerator' AFAIK there are no plans by any third part libraries to increase their inter-operability with the numeric tower. My point is that in choosing what types to accept and how to coerce them you should focus on actual practical benefits rather than theoretical ones. If it can be done in a way that means it works for more numeric types then that's great. But when I say "works" I mean that it should ideally achieve the best possible accuracy for each type. If that's not possible then it might be simplest to just document how it works for combinations of the std lib types (and perhaps subclasses thereof) and then say that it will fall back on coercing to float for anything else. This approach is simpler to document and for end-users to understand. It also has the benefit that it will work for all non std lib types (that I'm aware of) without pretending to achieve more accuracy than it can. >>> import sympy, fractions, gmpy >>> fractions.Fraction(sympy.Rational(1, 2)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/fractions.py", line 148, in __new__ raise TypeError("argument should be a string " TypeError: argument should be a string or a Rational instance >>> float(sympy.Rational(1, 2)) 0.5 >>> fractions.Fraction(gmpy.mpq(1, 2)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/fractions.py", line 148, in __new__ raise TypeError("argument should be a string " TypeError: argument should be a string or a Rational instance >>> float(gmpy.mpq(1, 2)) 0.5 Coercion to float via __float__ is well supported in the Python ecosystem. Consistent support for getting exact integer ratios is (unfortunately) not.
msg210126 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-03 14:14
Just to make sure that this discussion is not getting on the wrong track, there are currently two strict requirements for any numeric type to be usable with statistics._sum: (1) the type has to provide either - numerator/denominator properties or - an as_integer_ratio method or - an as_tuple method that mimicks the Decimal method of that name this requirement comes from statistics._exact_ratio. This is where, for example, the sympy numeric types fail because they do not provide any of these interfaces. ************** For completeness, this is the code of this function: def _exact_ratio(x): """Convert Real number x exactly to (numerator, denominator) pair. >>> _exact_ratio(0.25) (1, 4) x is expected to be an int, Fraction, Decimal or float. """ try: try: # int, Fraction return (x.numerator, x.denominator) except AttributeError: # float try: return x.as_integer_ratio() except AttributeError: # Decimal try: return _decimal_to_ratio(x) except AttributeError: msg = "can't convert type '{}' to numerator/denominator" raise TypeError(msg.format(type(x).__name__)) from None except (OverflowError, ValueError): # INF or NAN if __debug__: # Decimal signalling NANs cannot be converted to float :-( if isinstance(x, Decimal): assert not x.is_finite() else: assert not math.isfinite(x) return (x, None) *************** (2) Essentially, the numerator and the denominator returned by _exact_ratio have to be valid arguments for the Fraction constructor. This is a consequence of this block of code in _sum: for d, n in sorted(partials.items()): total += Fraction(n, d) Of note, Fraction(n, d) requires both arguments to be members of numbers.Rational and this is where, for example, the gmpy.mpq type fails. (3) The type's constructor has to work with a Fraction argument. This is because _sum tries to return T(total) # where T is the coerced type and total the calculated sum as a Fraction The gmpy.mpq type, for example, fails at this step again. ACTUALLY: THIS SHOULD BE RAISED AS ITS OWN ISSUE HERE as soon as the coercion part is settled because it means that _sum may succeed with certain mixed input types even though some of the same types may fail, when they are the only type. IMPORTANTLY, neither requirement has anything to do with the module's type coercion, which is the topic of this discussion. IN ADDITION, the proposed patch involving _coerce_types adds the following (soft) requirement: in order to be able to coerce a sequence of numbers of mixed numeric types without loss of precision all involved types need to be integrated into the hierarchy of numeric abstract base classes as defined in the numbers module (otherwise all types are coerced to float). From this it should be clear that the compatibility bottleneck is not in this patch, but in other parts of the module. What is more, if a custom numeric type implements the numerator/denominator properties, then it is simple enough to register it as a virtual subclass of numbers.Rational or .Integral to enable lossless coercion in the presence of mixed types. Example with gmpy2 numeric type: >>> from gmpy2 import mpq # mpq is gmpy's Fraction equivalent >>> import numbers >>> numbers.Rational.register(type(mpq())) >>> r = mpq(7,5) >>> type(r) <class 'mpq'> >>> issubclass(type(r), numbers.Rational) True => the mpq type could now be coerced correctly even in mixed input types situations, but remains incompatible with _sum for reasons (2) and (3). SUMMARY: Making _sum work with custom types is very complicated, BUT: this is not due to type coercion as it is discussed here.
msg210127 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-03 14:18
> there are currently two strict requirements for any numeric type to be usable with statistics._sum: I meant three of course (remembered one only during writing).
msg210137 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2014-02-03 15:13
I agree that supporting non-stdlib types is in some ways a separate issue from how to manage coercion with mixed stdlib types. Can you provide a complete patch (e.g. hg diff > coerce_types.patch). http://docs.python.org/devguide/ There should probably also be tests added for situations where the current implementation behaves undesirably. Something like these ones: http://hg.python.org/cpython/file/a97ce3ecc96a/Lib/test/test_statistics.py#l1445 Note that when I said non-stdlib types can be handled by coercing to float I didn't mean that the output should be coerced to float but rather the input should be coerced to float because __float__ is the most consistent interface available on third party numeric types. Once the input numbers are converted to float statistics._sum can handle them perfectly well. In this case I think the output should also be a float so that it's clear that precision may have been lost. If the precision of float is not what the user wants then the documentation can point them toward Fraction/Decimal.
msg210143 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-03 15:47
> Once the input numbers are converted to float statistics._sum can handle > them perfectly well. In this case I think the output should also be a float so > that it's clear that precision may have been lost. If the precision of float is not > what the user wants then the documentation can point them toward > Fraction/Decimal. Ah, I'm getting it now. That is actually a very interesting thought. Still I don't think this should be part of this issue discussion, but I'll think about it and file a new enhancement issue if I have an idea. As for providing the complete patch, I can do that I guess, but formulating the tests may take a while. I'll try to do it though, if Steven thinks it's worth it. After all he's the one who'd have to approve it and the patch does alter his design quite a bit.
msg210202 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-04 11:15
I think it's also acceptable at this point for the module docs to just say that handling of mixed type input is undefined and implementation dependent, and recommend doing "map(int, input_data)", "map(float, input_data)", "map(Decimal, input_data)" or "map(Fraction, input_data)" to ensure getting a consistent answer for mixed type input. I believe it would also be acceptable for the module to just fail immediately as soon as it detects an input type that differs from the type of the first value rather than attempting to guess the most appropriate behaviour. This close to 3.4rc1 (Sunday 9th February), I don't think we want to be committing to any particular implementation of type coercion.
msg210236 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2014-02-04 14:46
I was working on the basis that we were talking about Python 3.5. But now I see that it's a 3.4 release blocker. Is it really that urgent? I think the current behaviour is very good at handling a wide range of types. It would be nice to consistently report errors for incompatible types but it can also just be documented as a thing that users shouldn't do. If there were a situation where it silently returned a highly inaccurate value I would consider that urgent but I don't think there is.
msg210239 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-04 15:11
Changing the behaviour is not urgent - documenting that it may change in the future is essential.
msg210243 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-04 15:43
Hi Nick and Oscar, my patch including tests is ready. What else should I say than that I think it is ok, but of course with only days remaining and still no feedback from Steven, I'm not sure there is any chance for it going into 3.4 still. Anyway, before uploading I wanted to ask you what the best procedure is: upload the module and its tests as one diff or separately so that it is easier to see where the current module version fails? Best, Wolfgang
msg210247 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2014-02-04 16:32
Wolfgang, Thanks for the patch, I have some concerns about it, but the basic idea does look reasonable. However, I've been convinced that supporting mixed types at all needs more careful thought. Under the circumstances, I'm more concerned about making sure we don't lock in any behaviour we'll regret, so for 3.4 I'm going to follow Nick's advice and reject mixed-type calculations.
msg210251 - (view)	Author: Wolfgang Maier (wolma) *	Date: 2014-02-04 16:45
Hi Steven, sounds reasonable, still here's the patch in diff version. Its rules for type coercion are detailed in _coerce_types's docstring. Questions and comments are welcome. Best, Wolfgang
msg210557 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2014-02-07 21:04
Attached is a patch which: - documents that mixed types are not currently supported; - changes the behaviour of _sum to raise TypeError on mixed input types (mixing int and <other> is allowed, but nothing else); - updates the tests; - adds some documentation changes for issue 20389.
msg210604 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-08 09:49
Claiming to commit before 3.4rc1 (as I believe Steven's SSH key still needs to be added)
msg210607 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-08 10:14
New changeset 5db74cd953ab by Nick Coghlan in branch 'default': Close #20481: Disallow mixed type input in statistics http://hg.python.org/cpython/rev/5db74cd953ab
msg210611 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-08 10:19
OK, I committed a slight variant of Steven's patch: - I moved the issue 20389 doc changes to a separate patch that I uploaded over there - Steven had marked one of the tests as potentially obsolete, I just removed it entirely. If we decide we eventually want it back, it's still there in the VCS history :)
msg210663 - (view)	Author: Oscar Benjamin (oscarbenjamin) *	Date: 2014-02-08 15:56
> Close #20481: Disallow mixed type input in statistics If I understand correctly the reason for hastily pushing this patch through is that it's the safe option: disallow mixing types as a quick fix for soon to be released 3.4. If we want to allow mixing types then that can be a new feature for 3.5. Is that correct? If so should the discussion about what to do in 3.5 take place in this issue (i.e. reopen for 3.5) or should it be a new issue? issue20499 would benefit from clarity about the statistics module policy for mixing types.
msg210694 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-08 22:21
Yes, a new RFE to sensibly handle mixed type input in 3.5 would make sense (I did something similar for the issue where we removed the special casing of Counter for 3.4).

History
Date	User	Action	Args
2022-04-11 14:57:57	admin	set	github: 64680
2014-02-08 22:21:40	ncoghlan	set	messages: + msg210694
2014-02-08 15:56:05	oscarbenjamin	set	messages: + msg210663
2014-02-08 10:19:08	ncoghlan	set	messages: + msg210611
2014-02-08 10:14:09	python-dev	set	status: open -> closed nosy: + python-dev messages: + msg210607 resolution: fixed stage: needs patch -> resolved
2014-02-08 09:49:14	ncoghlan	set	assignee: steven.daprano -> ncoghlan messages: + msg210604
2014-02-07 21:04:40	steven.daprano	set	files: + stats.patch messages: + msg210557
2014-02-04 16:45:06	wolma	set	files: + statistics.patch keywords: + patch messages: + msg210251
2014-02-04 16:32:06	steven.daprano	set	messages: + msg210247
2014-02-04 15:43:14	wolma	set	messages: + msg210243
2014-02-04 15:11:06	ncoghlan	set	messages: + msg210239
2014-02-04 14:46:13	oscarbenjamin	set	messages: + msg210236
2014-02-04 11:15:36	ncoghlan	set	messages: + msg210202
2014-02-03 15:47:59	wolma	set	messages: + msg210143
2014-02-03 15:13:49	oscarbenjamin	set	messages: + msg210137
2014-02-03 14:18:32	wolma	set	messages: + msg210127
2014-02-03 14:14:17	wolma	set	messages: + msg210126
2014-02-03 11:11:46	oscarbenjamin	set	messages: + msg210113
2014-02-02 22:04:17	wolma	set	messages: + msg210030
2014-02-02 12:33:38	oscarbenjamin	set	messages: + msg209976
2014-02-02 11:57:55	steven.daprano	set	assignee: steven.daprano
2014-02-02 07:45:08	wolma	set	files: + _sum_coerce.py messages: + msg209956
2014-02-02 01:31:51	ncoghlan	create