Title: Augment random.choices() with the alias method
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.10
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: mark.dickinson, rhettinger, tim.peters
Priority: normal Keywords:

Created on 2020-06-26 20:38 by rhettinger, last changed 2020-07-31 01:34 by tim.peters. This issue is now closed.

File name Uploaded Description Edit rhettinger, 2020-06-27 08:08 Binomial variant rhettinger, 2020-06-27 23:42 Alias variant
Messages (6)
msg372441 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-06-26 20:38
For n unequal weights and k selections, sample selection with the inverse-cdf method is O(k logâ‚‚ n).  Using the alias method, it improves to O(k).  The proportionally constants also favor the alias method so that if the set up times were the same, the alias method would always win (even when n=2).

However, the set up times are not the same.  For the inverse-cdf method, set up is O(1) if cum_weights are given; otherwise, it is O(n) with a fast loop.  The setup time for the alias method is also O(n) but is proportionally much slower.

So, there would need to be a method selection heuristic based on the best trade-off between setup time and sample selection time.

Both methods make k calls to random().


Notes on the attached draft implementation:

* Needs to add back the error checking code.

* Need a better method selection heuristic.

* The alias table K defaults to the original index
  so that there is always a valid selection even
  if there are small rounding errors.

* The condition for the aliasing loop is designed
  to have an early-out when the remaining blocks
  all have equal weights.  Also, the loop condition
  makes sure that the pops never fail even if there
  are small rounding errors when partitioning
  oversized bins or if the sum of weights isn't
  exactly 1.0.
msg372443 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-06-26 20:57
I also looked at another method using binomial variates but couldn't get it to run faster than the alias method:

    def choices(population, weights, *, k=1):
        r = 1.0
        n = k
        selections = []
        for elem, p in zip(population, weights):
            v = binomial_variate(n, p / r)
            selections += [elem] * v
            n -= v
            r -= p
        return selections

The shuffle step took as much time as the alias method.  Also, the binomial variate was hard to compute quickly and without overflow/underflow issues for large inputs.
msg374622 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-07-30 20:27
I'm not clear on that the alias method is valuable here. Because of the preprocessing expense, it cries out for a class instead: an object that can retain the preprocessed tables, be saved to disk (pickled), restored later, and used repeatedly to make O(1) selections. I have such a class, which uses exact integer arithmetic (e.g., during preprocessing, there's at least one "too small" bin if and only if there's at least one "too large" bin), and is as unbiased as choice() and randrange(). Reasoning about float vagaries is much harder, and I see no error analysis here.

Most troubling: that due to rounding, reducing a "too large" bin results in a weight that spuriously compares <= bin_size. Then a "too small" bin with a genuinely tiny weight can be left with nothing to pair with, and so ends up aliasing itself, giving it a massively too-high probability of being picked.

So if you're determined to stick to "a function", I'd stick to the simple inverse CDF sampling method.  Unless the number of choices is "very" large, the log factor doesn't much matter.

That said, the alias numerics here could be improved some by working with the weights directly, "normalizing" them only at the end.  Instead of comparing weights to bin_size, compare them instead to the mean:

mean = total / n

(note: the next step to getting rid of floats entirely is to pre-multiply the weights by lcm(total, n) // total  - then their mean is exactly the integer lcm(total, n) // n).

The mean (instead of bin_size) suffers rounding errors then, but the original weights don't during the "hard part" of preprocessing.  At the end,

fn = n + 0.0
U = [weights[i] / total + i / fn for i in range(n)]

instead.  The "i / fn" also suffers just one rounding error as opposed to the two in the current "bin_width * i".  That can make a difference for n as small as 5:

>>> 3/5
>>> (1/5)*3
msg374624 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-07-30 23:40
Okay, thank you for looking!
msg374625 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-07-30 23:49
FWIW, I paid close attention to rounding.  The reason for the "while small and large" is that the loop stops making aliases unless we have both something at or below 0.5 and something known to be above 0.5.  If we end up with only two bins slightly above over below 0.5, they are considered close-enough and will alias to themselves.

By giving the bins a default alias to themselves, there is always a reasonable dispatch in the main sampling loop even if the random() value compares to an unfavorably rounded entry in U.

The code assumes that weird roundings will happen through out.
msg374627 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-07-31 01:34
Oh yes - I understood the intent of the code.  It's as good an approach to living with floating-point slop in this context as I've seen.  It's not obviously broken.  But neither is it obviously correct, and after a few minutes I didn't see a clear path to rigorously proving it can't break.  In FP, if there's one chance in 2**53 that it _can_ break, it will ;-)

The suggestions I made change none of that:  reducing the number of rounding errors can only help.

OTOH, an all-integer version of the algorithm _can_ be "obviously correct", and can embed asserts to verify that it is.  It can all be done in a single fixed-size list of 3-lists ([weight, value, alias]), swapping entries as needed to keep the 3 regions of interest ("too small", "on the nose", "too large") each contiguous.  Then it's straightforward to prove that you run out of "too small" entries at the same time you run out of "too large" ones.
Date User Action Args
2020-07-31 01:34:30tim.peterssetmessages: + msg374627
2020-07-30 23:49:21rhettingersetmessages: + msg374625
2020-07-30 23:40:09rhettingersetstatus: open -> closed
resolution: wont fix
messages: + msg374624

stage: resolved
2020-07-30 20:27:49tim.peterssetmessages: + msg374622
2020-06-27 23:42:12rhettingersetfiles: -
2020-06-27 23:42:01rhettingersetfiles: +
2020-06-27 08:08:02rhettingersetfiles: +
2020-06-27 08:07:43rhettingersetfiles: -
2020-06-27 03:53:32rhettingersetfiles: +
2020-06-26 20:57:17rhettingersetmessages: + msg372443
2020-06-26 20:38:41rhettingercreate