classification
Title: random.sample() behavior is unexpected/unclear from docs
Type: behavior Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Scott Eilerman, docs@python, rhettinger, tim.peters
Priority: normal Keywords:

Created on 2018-03-21 14:52 by Scott Eilerman, last changed 2018-03-27 00:54 by rhettinger. This issue is now closed.

Messages (9)
msg314201 - (view) Author: Scott Eilerman (Scott Eilerman) Date: 2018-03-21 14:52
I ran into a "bug" when using random.sample() in which I got some results I didn't expect. After digging a little more, this is either a side effect of  the optimization that's made when k > 5, or I am using the function in a way that wasn't intended. If that's the case, I would recommend calling out this behavior in the documentation.

The crux of the issue is that, for a given seed, random.sample(choices,k) gives the same sequence of results for k=1 to k=5, but that sequence can be different (for the same seed) at k=6 and higher. From my initial testing this seems to only occur when 'choices' has an even length.

Example code to reproduce this issue:

import random
seed = 199
choices = range(-10,12)

for k in range(10):
    random.seed(seed)
    print(random.sample(choices,k))


Example code to look at many different occurrences of this issue:

import random
choices = range(-10,12)
count = 0
for seed in range(200):
    for k in range(8):
        random.seed(seed)
        seq1 = random.sample(choices, k)
        random.seed(seed)
        seq2 = random.sample(choices, k+1)
        if seq1 != seq2[:-1]:
            print(seed)
            print(seq1)
            print(seq2)
            count += 1
print(f'Number of bugged results: {count}/200')


To illustrate the odd/even issue, changing choices to range(-10,11) results in zero bugged results.
msg314202 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-03-21 15:17
What additional wording do you propose?
msg314204 - (view) Author: Scott Eilerman (Scott Eilerman) Date: 2018-03-21 15:22
Something along the lines of: "For a fixed seed, random.sample(population, k) is not guaranteed to return the same samples for different values of k."
msg314205 - (view) Author: Scott Eilerman (Scott Eilerman) Date: 2018-03-21 15:26
To clarify the use case where this behavior was problematic for me, I wanted to get the nth random draw from a given distribution, so I used something like:

random.seed(fixed_seed)
random.sample(choices, n)[-1]

Then, later, I want the next draw, so I did:

random.seed(fixed_seed)
random.sample(choices, n)[-1]


The workaround would be to use random.shuffle and pick the nth item, or use random.sample once with the highest k value I expect to need and then pick the nth element from that list. But I settled on the above implementation because for the first few cases I tested, it returned the results I expected, and nothing in the docs suggested I should expect otherwise.
msg314206 - (view) Author: Scott Eilerman (Scott Eilerman) Date: 2018-03-21 15:31
Sorry, there's a typo in that example. It should look like:


random.seed(fixed_seed)
random.sample(choices, n)[-1]

Then, later, I want the next draw, so I did:

random.seed(fixed_seed)
random.sample(choices, n+1)[-1]
msg314330 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-03-23 21:22
> Something along the lines of: "For a fixed seed, random.sample(population, k)
> is not guaranteed to return the same samples for different values of k."

In a way, the proposed wording succinctly directly addresses the problem you had.  So, it would seem like a reasonable suggestion.  On the other hand, it would be easy for others who haven't had this problem to have a hard time figuring out what it means (when should they be worried, what should be avoided, why is it a concern at all, what to do about it).

In general, the docs are worded in an affirmative manner (here's what something does, here's what it is for, and here is how to use it correctly).  In this case, the docs already indicate the intended way to address this use case: "the resulting list is in selection order so that all sub-slices will also be valid random samples.  This allows raffle winners (the sample) to be partitioned into grand prize and second place winners (the subslices)."

Perhaps there could be an algorithmic note, "internally, sample() shifts selection algorithms depending on the proportion of the population being sampled".  However, this would be unusual -- we don't usually document implementation details.  Numpy[1] and R[2] make no mention of the internals.  Julia[3] does discuss the algorithms but primarily from an efficiency point-of-view rather than as a usage note.

Perhaps it may be best to leave this alone rather than adding a note that may itself create confusion and worry.  AFAICT, this hasn't come up before in the 15 year history of random.sample(), not even a StackOverflow question.

[1] https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.choice.html
[2] https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/sample
[3] http://juliastats.github.io/StatsBase.jl/stable/sampling.html#Sampling-API-1
msg314397 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-03-25 06:05
There's nothing in the docs I can see that implies `sample(x, n)` is a prefix of what `sample(x, n+1)` would have returned had the latter been called instead.  If so, then - as always - it's "at your own risk" when you rely on behaviors that haven't been specified.  Since the universe of _possible_ behaviors that haven't been specified is immense, I'd rather we didn't even start to list all of them ;-)

Seriously, when the docs don't promise something, it's usually a mistake to assume they just forgot to mention it.  I'd leave the docs alone in this case.
msg314456 - (view) Author: Scott Eilerman (Scott Eilerman) Date: 2018-03-26 15:01
Raymond, Tim, thanks for your replies so far. I understand (and for the most part, agree with) your points about not being able to list every behavior, and not wanting to cause uncertainty in users. However, let me argue my case one more time, and if you still disagree, feel free to close this.

1. It is expected (in fact, one might argue it's the entire point) that initializing random.seed() with a fixed value will produce a repeatable set of results for a traditional random number generator. An user expects that calling the following should always produce the same sequence of numbers:

random.seed(22)
random.random()
random.random()
random.random()

2. Based on that behavior for one of the most typical/traditional functions in the random module, a naive user (me) might assume that random.sample() is drawing from its population in a similar manner (i.e. that sequence of returned items, regardless of how many you ask the function to return, is uniquely determined by the seed). While this is certainly an assumption...

2a. This assumption is somewhat validated by the introductory section of the random module docs, which states "Almost all module functions depend on the basic function random()..."

2b. More importantly, an user can "validate" this assumption by doing some simple tests, e.g.:

choices = range(100)
random.seed(22)
random.sample(choices,1)
random.seed(22)
random.sample(choices,2)
random.seed(22)
random.sample(choices,3)
... and so on

Because of the nature of the set/list optimization, it is VERY possible that an user could do due diligence in testing like this (a few different seeds, a few different sets of "choices", testing up to k=10) and never uncover the problematic behavior. You'd pretty much have to set up some loops like I did earlier in this thread, which I don't think many users would do unless the expect to find a problem. Even then, with certain selections of "choices", you might still get the "expected" results.

2c. If you suspected a problem, or really wanted to be sure the function does what you assume it will do, obviously you can open up random.py and take a look. However, I doubt many users do this for every built-in module and function they use; clearly the point of documentation is to avoid this scenario.

3. As Raymond mentioned, this does not appear to be a "common" problem, and perhaps that is enough to not add anything to the docs. However, due to the somewhat elusive nature of the behavior, it could certainly go undetected in many cases, potentially causing problems without anyone noticing. Perhaps I chose a very unorthodox implementation to get the results I desired; I easily could have used random.shuffle() or random.sample(pop, len(pop)) and picked the nth element. However, one could imagine cases in which you have a very large population and you want to optimize by using sample() to get the nth random draw rather than randomizing the entire list, so I don't think it's an entirely unjustified approach.

4. Given the above points, I'd argue that a one-line insertion into the docs would help users steer clear of a hard-to-anticipate, potentially costly pitfall. My suggested language is a more direct identification of the possible consequences, though I agree that it it perhaps too worry-inducing without specifying the "cause" of the problem. Raymond's algorithmic note may be a better choice and would have been enough of an indicator for me to avoid the mistake I made.
msg314487 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-03-27 00:54
>  However, let me argue my case one more time, and if you still disagree, feel free to close this.

Thank you for expressing your ideas so clearly.  However, I'm going to go with Tim's recommendation and leave the docs as is.

At this point, you already have a complete understanding of what the tool does and how to use it, so there isn't anything at stake other than a feeling a vindication for a bug that is already resolved.  Looking to the needs of other users, I think we're best off not even bringing up non-guaranteed implementation details and instead follow the example of Numpy and R.  Thanks again for the suggestion, but we'll respectfully decline this time around.
History
Date User Action Args
2018-03-27 00:54:27rhettingersetstatus: open -> closed
resolution: not a bug
messages: + msg314487

stage: resolved
2018-03-26 15:01:36Scott Eilermansetmessages: + msg314456
2018-03-25 06:05:09tim.peterssetnosy: + tim.peters
messages: + msg314397
2018-03-23 21:22:28rhettingersetmessages: + msg314330
2018-03-21 15:31:08Scott Eilermansetmessages: + msg314206
2018-03-21 15:26:54Scott Eilermansetmessages: + msg314205
2018-03-21 15:22:31Scott Eilermansetmessages: + msg314204
2018-03-21 15:17:33rhettingersetassignee: docs@python -> rhettinger

messages: + msg314202
nosy: + rhettinger
2018-03-21 14:52:57Scott Eilermancreate