Issue 39867: randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84048

classification

Title:	randrange(N) for N's in same dyadic blocs have excessive correlations when sharing identical seeds
Type:		Stage:	resolved
Components:		Versions:	Python 3.8

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	jfbu, mark.dickinson, rhettinger, tim.peters
Priority:	normal	Keywords:

Created on 2020-03-05 20:15 by jfbu, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
testrandrange.py	jfbu, 2020-03-05 20:15

Messages (9)
msg363451 - (view)	Author: jfbu (jfbu) *	Date: 2020-03-05 20:15
We generate quadruples of random integers using randrange(n) and randrange(m) and count how many times the quadruples are identical, using the same random seed. Of course for nearby n and m (the real life example was with n==95 and m==97) we do expect matches. But we found orders of magnitude more than was expected. The attached file demonstrates this by comparison with random()*n (with rounding) as alternative method to generate the random integers (we are aware this gives less uniformity for a given range, but these effects are completely negligible in comparison to the effect we test). For the latter the probability of matches is non-vanishing but orders of magnitude smaller than using randrange(n). Here is an excerpt of our testing result. Each trial uses a random seed (selected via randrange(100000000)). Then 4 random integers in two given ranges are generated and compared. A hit is when all 4 match. - with randrange(): n = 99, m = 124, 4135 hits among 10000 trials n = 99, m = 125, 3804 hits among 10000 trials n = 99, m = 126, 3803 hits among 10000 trials n = 99, m = 127, 3892 hits among 10000 trials n = 99, m = 128, 0 hits among 10000 trials n = 99, m = 129, 0 hits among 10000 trials n = 99, m = 130, 0 hits among 10000 trials n = 99, m = 131, 0 hits among 10000 trials - with random(): n = 99, m = 124, 0 hits among 10000 trials n = 99, m = 125, 0 hits among 10000 trials n = 99, m = 126, 0 hits among 10000 trials n = 99, m = 127, 0 hits among 10000 trials n = 99, m = 128, 0 hits among 10000 trials n = 99, m = 129, 0 hits among 10000 trials n = 99, m = 130, 0 hits among 10000 trials n = 99, m = 131, 0 hits among 10000 trials The test file has some hard-coded random seeds for reproductibility. Although I did only limited testing it is flagrant there is completely abnormal correlation between randrange(n) and randrange(m) when the two integers have the same length in base 2. Tested with 3.6 and 3.8.
msg363452 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2020-03-05 20:29
These results are hardly surprising, given the algorithms used. But why would you consider this a problem? If you want your randrange(n) and randrange(m) results to be independent, don't generate them from the exact same random source, starting with the same seed. Use different seeds, or generate one set of results followed by the other without resetting the seed in between.
msg363454 - (view)	Author: jfbu (jfbu) *	Date: 2020-03-05 21:18
Yes indeed the source code of _randbelow_with_getrandbits generates this effect. And I understand the puzzlement about my test file setting the same random seed and then complaining about correlations. But there is some non-uniformity which is enormous between what happens for say n=99 and m=127 versus n=99 and m=128. Now that I looked at the actual source code, I have in other programming contexts available manners of reducing from a power of 2 to an arbitrary range which should not show this artefact. I will need to translate it into Python and may submit a PR for evaluation after having tested. My test file illustrates that if randrange() proceeded from a pseudo-random variable emulating the uniform distribution in (0,1) and rescaled and rounded to an integer range, correlations between distinct ranges would be of a completely different nature than what the CPython method leads to.
msg363459 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-03-05 22:25
Sorry, I don't see "a problem" here either. Rounding instead can change the precise nature of the correlations if you insist on starting from the same seed, but it hardly seems a real improvement; e.g., >>> random.seed(12) >>> [round(random.random() * 100) for i in range(10)] [47, 66, 67, 14, 1, 37, 27, 81, 69, 60] >>> random.seed(12) >>> [round(random.random() * 101) for i in range(10)] [48, 66, 67, 14, 1, 38, 28, 82, 70, 61] That is, while there are fewer identical values, the correlation is nevertheless obvious and extreme. Not only not a bug, it's not even surprising ;-)
msg363461 - (view)	Author: jfbu (jfbu) *	Date: 2020-03-05 23:00
@tim.peters yes, a uniform random variable rescaled to two nearby scales N and M will display strong correlations. The CPython randrange() exhibits however orders of magnitude higher such correlations, but only in relation to a common bitlength. A randrange() function should a priori not be so strongly tied to the binary base. The example you show would not be counted as a hit by my test for the randomseed 12. >>> s = 0 >>> for t in range(100000): ... random.seed(t) ... x = [round(random.random() * 100) for i in range(10)] ... random.seed(t) ... y = [round(random.random() * 101) for i in range(10)] ... if x == y: ... s += 1 ... >>> s 94 >>> s = 0 >>> for t in range(100000): ... random.seed(t) ... x = [random.randrange(100) for i in range(10)] ... random.seed(t) ... y = [random.randrange(101) for i in range(10)] ... if x == y: ... s += 1 ... >>> s 90432
msg363471 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-03-06 01:54
This is where you're not getting traction: "A randrange() function should a priori not be so strongly tied to the binary base." That's a raw assertion. _Why_ shouldn't it be? "Because I keep saying so" isn't changing minds ;-) I understand you're looking at exact equality of t-tuples. I wasn't in my example: I was looking at the individual values, one pair at a time. The extreme correlation is dead obvious by eyeball either way, despite that the only test you seem to have in mind (exact equality of t-tuples) is blind to it. Why is that test so important? Why does it not matter that, e.g., number of inversions, number of runs, distribution of run-lengths (etc) remain highly correlated regardless? Nobody else has had a problem with this, and it remains unclear why you do: what's your objection to Mark's suggestions (use different seeds, or _don't_ reset the seed)? That's the obvious approach: use the facilities in straightforward ways. In any case, we can't/won't make changes on a whim. As far as possible, we strive to keep results bit-for-bit identical across releases for people who save/set seeds, hoping to get reproducible results. Changing the results from any random module function requires strong justification. So far, I don't see that here.
msg363482 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-03-06 05:13
I concur with Tim and Mark. This isn't a bug. The randomization occurs at the getrandbits() level. The downstream functions, such as randbelow and randrange, have no responsibility to create additional dispersion; instead, their role is to map the upstream scattering to the desired target distributions. The correlations found by the OP are indeed unsurprising. They disappear entirely when using two difference seeds or by making successive selections from a single stream.
msg363491 - (view)	Author: jfbu (jfbu) *	Date: 2020-03-06 07:23
"bug" is a strong word, which I did never employ myself, apart from using this channel of report. I rather think of a (non-documented) "deficiency", but I expect the consensus will again be that I am expressing a "raw expression". However reading more than once that "the correlations are indeed unsurprising" it is my turn to see there a raw expression. The correlations are unsurprising only if one looks at the source code and understand how a (to a very high degree) uniform distribution on a power of 2 range is reduced to distribution on the smaller range (keeping the extremely high uniformity). Thus, sorry, the correlations are to the contrary very surprising to the end user who has no knowledge of the internals. Donald Knuth for example many decades ago in his work on MetaPost used a RNG which is a kind a primitive ancestor (in the family of those commented upon in AOCP) of the much more sophisticated one used nowadays by Python. He used the rescaling with rounding method to go from power of 2 range to non power of 2 range. That method induces some non-uniformity and if I understand (without having checked) it was the one from Python < 3.2. At some point Python changed its way to another way, to cure non-uniformity (and other artefacts of the simple minded rescaling method). But there are other ways to reduce the non-uniformity to negligible levels which are not the choice made currently in Python code. (which for people not having _randbelow_with_getrandbits before their eyes is simply to draw a random integer with at most the number of bits of the limit N until one is found at most equal to N).
msg363572 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-03-07 02:10
I've been involved - usually very lightly, but sometimes deeply - with PRNGs for over 40 years. I've never seen anyone raise the kinds of concerns you're raising here, and, frankly, they don't make sense to me. But on the chance that I've missed recent fundamental developments, I did some searching. The state of the art for uniform selection from a range appears to be what the Go language recently adopted, shown as Algorithm 5 in this paper (which also surveys what other languages do): https://arxiv.org/pdf/1805.10941.pdf Nobody gives a rip about "correlations" when reusing an internal state. What they're aiming at is a combination of "no bias" and "peak speed". Python's job is harder, because we couldn't care less what the native machine word size is. For example, randrange(10*500) works fine. Approaches based on chopping back multiplication with a "random" C double can't get more than 53 bits (on most machines), the limit of IEEE-754 format double precision. Python dropped that approach not only because it was slightly biased, but also because it didn't scale to unbounded ranges. The Go approach works like this Python code, where NBITS would be hard coded in C to 32 or 64, depending on the machine word size. Its advantage is that it _usually_ needs no arbitrary divisions (implied by "%") at all, just masking and shifting. At worst, it _may_ need one for-real division. But it's built to work with native machine integer precision, and requires full double-width integer multiplication: from random import randrange NBITS = 64 # need NBITS x NBITS full-precision multiply POWER = 1 << NBITS MASK = POWER - 1 def rr(s): # uniform random int in range(s) assert 0 < s <= POWER x = randrange(POWER) # i.e., a random bitstring with NBITS bits m = x s # full double-width product r = m & MASK if r < s: t = (POWER - s) % s # the expensive line while r < t: x = randrange(POWER) m = x * s r = m & MASK return m >> NBITS In a nutshell, in each [i * POWER, (i+1) * POWER) clopen interval, it rejects the first POWER % s values, leaving exactly floor(POWER/s) multiples of s in the interval. Of course that suffers "correlations" too if you reuse the seed - although not of the exact microscopic kind you're talking about. You haven't responded to questions about why that specific test is thought to be especially important, or to why you believe you can't do an obvious thing instead (use different seeds, or don't reset the seed at all). As to who's making raw assertions here, the situation isn't symmetric ;-) If you want a change, you have to _make a case_ for it. And a good one. Every core Python developer who has knowledge of this code has now told you they personally don't see any case for it. So, sorry, but the status quo is always favored in the absence of compelling reason to change. There were compelling reasons to switch to the current scheme. As you said yourself: """ And I understand the puzzlement about my test file setting the same random seed and then complaining about correlations. """ That was insightful. The difference is we never got over that puzzlement ;-)

History
Date	User	Action	Args
2022-04-11 14:59:27	admin	set	github: 84048
2020-03-07 02:10:21	tim.peters	set	messages: + msg363572
2020-03-06 07:23:45	jfbu	set	messages: + msg363491
2020-03-06 05:13:47	rhettinger	set	status: open -> closed resolution: not a bug messages: + msg363482 stage: resolved
2020-03-06 01:54:00	tim.peters	set	messages: + msg363471
2020-03-05 23:00:34	jfbu	set	messages: + msg363461
2020-03-05 22:25:06	tim.peters	set	nosy: + tim.peters messages: + msg363459
2020-03-05 21:18:11	jfbu	set	messages: + msg363454
2020-03-05 20:29:33	mark.dickinson	set	nosy: + rhettinger, mark.dickinson messages: + msg363452
2020-03-05 20:15:39	jfbu	create