I have written a patch for this issue (I'm uploading the complete new code for everyone to try it - importing it into Python3.3 works fine; a diff with additional tests against Oscar's example will follow soon).
Just as Oscar suggested, this new version performs all calculations using exact rational arithmetics and rounds/coerces only before returning the final result to the user. Its precision is, thus, only limited by that of the input data sequence.
It passes Oscar's examples 1-3 as you can easily test yourself. It also gives the correct answer in the fourth example - mean([D('1.2'), D('1.3'), D('1.55')]) -, although on my system the original statistics module gets this one right already.
The implementation I chose for this is a bit different from Oscar's suggestion. Essentially, it introduces a dedicated module-private class _ExactRatio to represent numbers as exact ratios and that gets passed between different functions in the module. This class borrows many of its algorithms from fractions.Fraction, but has some specialized methods for central tasks in the statistics module making it much more efficient in this context than fractions.Fraction. This class is currently really minimal, but can easily be extended if necessary.
In my implementation this new class is used throughout the module whenever calculations with or conversions to exact ratios have to be performed, which allowed me to preserve almost all of the original code and to factor out the changes to the class.
As for performance, the gain imagined by Oscar is not always realized even though the variance functions are now using single passes over the data. Specifically, in the case of floats the overhead of having to convert everything to exact ratios first eats up all the savings.
In the case of fractional input, there is a dramatic performance boost though. I compiled a small table comparing (kind of) average performance of the two versions with various input data types. Take this with a grain of salt because the differences can vary quite a bit depending on the exact data:
data type performance gain(+)/loss(-) over original module / %
--------- ----------------------------------------------------
float - 10 %
short Decimal + 10 %
long Decimal - 25 %
Fraction + 80 % (!!)
MyFloat + 25
With Decimal input the costs of conversion to exact ratios depends on the digits in the Decimals, so with short Decimals the savings from the single-pass algorithm are larger than the conversion costs, but this reverses for very long Decimals.
MyFloat is a minimal class inheriting from float and overriding just its arithmetic methods to return MyFloat instances again.
The performance gain with Fraction input comes from two changes, the single-pass algorithm and an optimization in _sum (with Fraction, more than with any other type, the dictionary built by _sum can grow quite large and the optimization is in the conversion of the dictionary elements to exact ratios). This is why the extent of this gain can sometimes be significantly higher than the 80% listed in the table.
Try this, for example:
from statistics import variance as v
from statistics_with_exact_ratios import variance as v2
from fractions import Fraction
data = [Fraction(1,x) for x in range(1,2000)]
print('calculating variance using original statistics module ...')
print(float(v(data)))
print('now using exact ratio calculations ...')
print(float(v2(data)))
I invite everybody to test my implementation, which is very unlikely to be free of bugs at this stage. |