Author rhettinger
Recipients Christian.Kleineidam, aisaac, dkorchem, madison.may, mark.dickinson, neil.g, pitrou, rhettinger, serhiy.storchaka, tim.peters, westley.martinez, xksteven
Date 2016-09-06.08:03:46
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1473149027.0.0.209747848588.issue18844@psf.upfronthosting.co.za>
In-reply-to
Content
Latest draft patch attached (w/o tests or docs).
Incorporates consultation from Alan Downey and Jake Vanderplas.

* Population and weights are separate arguments (like numpy.random.choice() and sample() in R).  Matches the way data would arrive in Pandas.  Easily extracted from a Counter or dict using keys() and values().  Suitable for applications that sample the population multiple times but using different weights.  See https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html and http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html

* Excludes a replacement=False option. That use case necessarily has integer weights and may be better suited to the existing random.sample() rather than trying to recompute a CDF on every iteration as we would have to in this function.

* Allows cumulative_weights to be submitted instead of individual weights.  This supports uses cases where the CDF already exists (as in the ThinkBayes examples) and where we want to periodically reuse the same CDF for repeated samples of the same population -- this occurs in resampling applications, Gibbs sampling, and Monte Carlo Markov Chain applications.  Per Jake, "MCMC/Gibbs Sampling approaches generally boil down to a simple weighted coin toss at each step" and "It's definitely common to do aggregation of multiple samples, e.g. to compute sample statistics"

* The API allows the weights to be integers, fractions, decimals, or floats.  Likewise, the population and weights can be any Sequence.  Population elements need not be hashable.

* Returning a list means that the we don't have to save state in mid-stream (that is why we can't use a generator).  A list feeds nicely into Counters, mean, median, stdev, etc for summary statistics.  Returning a list parallels what random.sample() does, keeping the module internally consistent.

* Default uniform weighting falls back to random.choice() which would be more efficient than bisecting.

* Bisecting tends to beat other approaches in the general case.  See http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python

* Incorporates error checks for len(population)==len(cum_weights) and for conflicting specification of both weights and cumulative weights.

There API is not perfect and there are some aspects that give me heartburn.  1) Not saving the computed CDF is waste and forces the user to pre-build the CDF if they want to save it for later use (the API could return both the selections and the CDF but that would be awkward and atypical).  2) For the common case of having small integer weights on a small population, the bisecting approach is slower than using random.choice on a population expanded to include the selections multiple times in proportion to their weights (that said, short of passing in a flag, there is no cheap easy way for this function to detect that case and give it a fast path).  3) Outputting a list is inefficient if all you're doing with result is summarizing it with a Counter, histogram tool, mean, median, or stdev.  4)  There is no cheap way to check to see if the user supplied cum_weights is sorted or if the weights contain negative values.
History
Date User Action Args
2016-09-06 08:03:47rhettingersetrecipients: + rhettinger, tim.peters, mark.dickinson, pitrou, aisaac, westley.martinez, serhiy.storchaka, neil.g, madison.may, dkorchem, Christian.Kleineidam, xksteven
2016-09-06 08:03:47rhettingersetmessageid: <1473149027.0.0.209747848588.issue18844@psf.upfronthosting.co.za>
2016-09-06 08:03:46rhettingerlinkissue18844 messages
2016-09-06 08:03:46rhettingercreate