Message 373771 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	mark.dickinson, oscarbenjamin, rhettinger, tim.peters
Date	2020-07-16.21:23:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1594934622.67.0.577315111626.issue41311@roundup.psfhosted.org>
In-reply-to

Content
Pro: focus on the "iterable" part of the title. If you want to, e.g., select 3 lines "at random" out of a multi-million-line text file, this kind of reservoir sampling allows to do that holding no more than one line in memory simultaneously. Materializing an iterable in a sequence is sometimes impossible. Yup, it takes O(number of elements) time, but so does anything else that has to look at every element of an iterable (including materializing an iterable in a sequence!). So this would make most sense as a different function. Con: over the years we've worked diligently to purge these algorithms of minor biases, leaving them as unbiased as the core generator. Introducing floating-point vagaries goes against that, and results may vary at times due to tiny differences in platform exp() and log() implementations. Worse, I'm not sure that the _approach_ is beyond reproach: the claim that skips "follow a geometric distribution" is baffling to me. If the reservoir has size K, then when we're looking at the N'th element it should be skipped with probability 1-K/N. But when looking at the N+1'th, that changes to 1-K/(N+1). Then 1-K/(N+2), and so on. These probabilities are all different. In an actual geometric distribution, the skip probability is a constant. Now 1-K/(N+i) is approximately equal to 1-K/(N+j) for i and j sufficiently close, and for K sufficiently smaller than N, so the skips may be well _approximated_ by a geometric distribution. But that's quite different than saying they _do_ follow a geometric distribution, and I see no reason to trust that the difference can't matter. The simple algorithm on the Wikipedia page suffers none of these potential problems (it implements an obviously-sound approach, & our `randrange()` is unbiased and platform-independent). But, for selecting 3 out of a million-element iterable, would invoke the core generator hundreds of thousands of times more often.

Pro: focus on the "iterable" part of the title. If you want to, e.g., select 3 lines "at random" out of a multi-million-line text file, this kind of reservoir sampling allows to do that holding no more than one line in memory simultaneously. Materializing an iterable in a sequence is sometimes impossible. Yup, it takes O(number of elements) time, but so does anything else that has to look at every element of an iterable (including materializing an iterable in a sequence!).

So this would make most sense as a different function.

Con: over the years we've worked diligently to purge these algorithms of minor biases, leaving them as unbiased as the core generator. Introducing floating-point vagaries goes against that, and results may vary at times due to tiny differences in platform exp() and log() implementations.

Worse, I'm not sure that the _approach_ is beyond reproach: the claim that skips "follow a geometric distribution" is baffling to me. If the reservoir has size K, then when we're looking at the N'th element it should be skipped with probability 1-K/N. But when looking at the N+1'th, that changes to 1-K/(N+1). Then 1-K/(N+2), and so on. These probabilities are all different. In an actual geometric distribution, the skip probability is a constant.

Now 1-K/(N+i) is approximately equal to 1-K/(N+j) for i and j sufficiently close, and for K sufficiently smaller than N, so the skips may be well _approximated_ by a geometric distribution. But that's quite different than saying they _do_ follow a geometric distribution, and I see no reason to trust that the difference can't matter.

The simple algorithm on the Wikipedia page suffers none of these potential problems (it implements an obviously-sound approach, & our `randrange()` is unbiased and platform-independent). But, for selecting 3 out of a million-element iterable, would invoke the core generator hundreds of thousands of times more often.

History
Date	User	Action	Args
2020-07-16 21:23:42	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, oscarbenjamin
2020-07-16 21:23:42	tim.peters	set	messageid: <1594934622.67.0.577315111626.issue41311@roundup.psfhosted.org>
2020-07-16 21:23:42	tim.peters	link	issue41311 messages
2020-07-16 21:23:42	tim.peters	create