Message 198687 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	oscarbenjamin
Recipients	mark.dickinson, oscarbenjamin, rhettinger
Date	2013-09-30.10:27:12
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CAHVvXxQgvQD5kazHXweWgT3LnDjAUDSPRMgfdfg4aQC=MhgdAQ@mail.gmail.com>
In-reply-to	<1380520325.81.0.484241471973.issue19086@psf.upfronthosting.co.za>

Content
I should be clearer about my intentions. I'm hoping to create an efficient and accurate sum() function for the new stdlib statistics module: http://bugs.python.org/issue18606 http://www.python.org/dev/peps/pep-0450/ The sum() function currently proposed can be seen in this patch: http://bugs.python.org/file31680/statistics_combined.patch It uses Fractions to compute an exact sum and separately tries to keep track of type coercion to convert the end result. It is accurate and returns a result of the appropriate type for any mix of ints, Fractions, Decimals and floats with the caveat that mixing Fractions and Decimals is disallowed. I believe the most common use-cases will be for ints/floats and for these cases it is 100x slower than sum/fsum. The discussion here: http://bugs.python.org/issue18606#msg195630 resulted in the sum function that you can see in the above patch. Following that I've had off-list discussions with the author of the module about improving the speed of the sum function (which is a bottleneck for half of the functions exposed by the module). He is very much interested in speed optimisations but doesn't want to compromise on accuracy. My own interest is that I would like an efficient and accurate (for all types) sum function somewhere in the stdlib. The addition of the statistics module with its new sum() function is a good opportunity to achieve this. If this were purely for myself I wouldn't bother this much with speed optimisation since I've probably already spent more of my own time thinking about this function than I ever would running it! (If I did want to specifically speed this up in my own work I would, as you say, use cython). If fsum() were modified in the way that I describe that it would be usable as a primitive for the statistics.sum() function and also for parallel etc. computation. I agree that the carry argument is unnecessary and that the main is just the exactness of the return value: it seems a shame that fsum could so easily give me an exact result but doesn't. As for exposing the internals of fsum, is it not the case that a non-overlapping set of floats having a given exact sum is a uniquely defined set? For msum I believe it is only necessary to strip out zeros in order to normalise the output so that it is uniquely defined by the true value of the sum in question (the list is already in ascending order once the zeros are removed). Uniqueness: If a many-digit float is given by something like (+-)abcdefghijkl * 2 ** x and we want to break it into non-overlapping 2-digit floats of the form ab2*x. Since the first digit of each 2-digit float must be non-zero (consider denormals as a separate case) the largest magnitude float is uniquely determined. Subtract that from the many-digit total and then the next largest float is uniquely determined and so on. There can be at most 1 denormal number in the result and this is also uniquely determined once the larger numbers are extracted. So the only way that the partials list can be non-unique is by the inclusion of zeros (unless I've missed something :).

I should be clearer about my intentions. I'm hoping to create an
efficient and accurate sum() function for the new stdlib statistics
module:
http://bugs.python.org/issue18606
http://www.python.org/dev/peps/pep-0450/

The sum() function currently proposed can be seen in this patch:
http://bugs.python.org/file31680/statistics_combined.patch

It uses Fractions to compute an exact sum and separately tries to keep
track of type coercion to convert the end result. It is accurate and
returns a result of the appropriate type for any mix of ints,
Fractions, Decimals and floats with the caveat that mixing Fractions
and Decimals is disallowed. I believe the most common use-cases will
be for ints/floats and for these cases it is 100x slower than
sum/fsum.

The discussion here:
http://bugs.python.org/issue18606#msg195630
resulted in the sum function that you can see in the above patch.
Following that I've had off-list discussions with the author of the
module about improving the speed of the sum function (which is a
bottleneck for half of the functions exposed by the module). He is
very much interested in speed optimisations but doesn't want to
compromise on accuracy.

My own interest is that I would like an efficient and accurate (for
all types) sum function *somewhere* in the stdlib. The addition of the
statistics module with its new sum() function is a good opportunity to
achieve this. If this were purely for myself I wouldn't bother this
much with speed optimisation since I've probably already spent more of
my own time thinking about this function than I ever would running it!
(If I did want to specifically speed this up in my own work I would,
as you say, use cython).

If fsum() were modified in the way that I describe that it would be
usable as a primitive for the statistics.sum() function and also for
parallel etc. computation. I agree that the carry argument is
unnecessary and that the main is just the exactness of the return
value: it seems a shame that fsum could so easily give me an exact
result but doesn't.

As for exposing the internals of fsum, is it not the case that a
*non-overlapping* set of floats having a given exact sum is a uniquely
defined set? For msum I believe it is only necessary to strip out
zeros in order to normalise the output so that it is uniquely defined
by the true value of the sum in question (the list is already in
ascending order once the zeros are removed).

Uniqueness: If a many-digit float is given by something like
(+-)abcdefghijkl * 2 ** x and we want to break it into non-overlapping
2-digit floats of the form ab*2**x. Since the first digit of each
2-digit float must be non-zero (consider denormals as a separate case)
the largest magnitude float is uniquely determined. Subtract that from
the many-digit total and then the next largest float is uniquely
determined and so on. There can be at most 1 denormal number in the
result and this is also uniquely determined once the larger numbers
are extracted. So the only way that the partials list can be
non-unique is by the inclusion of zeros (unless I've missed something
:).

History
Date	User	Action	Args
2013-09-30 10:27:13	oscarbenjamin	set	recipients: + oscarbenjamin, rhettinger, mark.dickinson
2013-09-30 10:27:13	oscarbenjamin	link	issue19086 messages
2013-09-30 10:27:12	oscarbenjamin	create