Message 393661 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	p-ganssle
Recipients	Guido.van.Rossum, Zac Hatfield-Dodds, gvanrossum, p-ganssle, terry.reedy
Date	2021-05-14.14:13:41
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1621001621.79.0.972438441052.issue42109@roundup.psfhosted.org>
In-reply-to

Content
@Terry > The problem with random input tests in not that they are 'flakey', but that they are useless unless someone is going to pay attention to failures and try to find the cause. This touches on the difference between regression testing and bug-finding tests. CPython CI is the former, and marred at that by buggy randomly failing tests. > My conclusion: bug testing would likely be a good idea, but should be done separate from the CI test suite. Such testing should only be done for modules with an active maintainer who would welcome failure reports. Are you saying that random input tests are flaky but that that is not the big problem? In my experience using hypothesis, in practice it is not the case that you get tests that fail randomly. The majority of the time if your code doesn't violate one of the properties, the tests fail the first time you run the test suite (this is particularly true for strategies where hypothesis deliberately makes it more likely that you'll get a "nasty" input by biasing the random selection algorithm in that direction). In a smaller number of cases, I see failures that happen on the second, third or fourth run. That said, if it were a concern that every run of the tests is using different inputs (and thus you might see a bug that only appears once in every 20 runs), it is possible to run hypothesis in a configuration where you specify the seed, making it so that hypothesis always runs the same set of inputs for the same tests. We can disable that on a separate non-CI run for hypothesis "fuzzing" that would run the test suite for longer (or indefinitely) looking for long-tail violations of these properties. I feel that if we don't at least run some form of the hypothesis tests in CI, there will likely be bit rot and the tests will decay in usefulness. Consider the case where someone accidentally breaks an edge case that makes it so that `json.loads(json.dumps(o))` no longer works for some obscure value of `o`. With hypothesis tests running in CI, we are MUCH more likely to find this bug / regression during the initial PR that would break the edge case than if we run it separately and report it later. If we run the hypothesis tests in a build-bot, the process would be: 1. Contributor makes PR with passing CI. 2. Core dev review passes, PR is merged. 3. Buildbot run occurs and the buildbot watch is notified. 4. Buildbot maintainers track down the PR responsible and either file a new bug or comment on the old bug. 5. Someone makes a NEW PR adding a regression test and the fix for the old PR. 6. Core dev review passes, second PR is merged. If we run it in CI, the process would be: 1. Contributor makes PR, CI breaks. 2. If the contributor doesn't notice the broken CI, core dev points it out and it is fixed (or the PR is scrapped as unworkable). Note that in the non-CI process, we need TWO core dev reviews, we need TWO PRs (people are not always super motivated to fix bugs that don't affect them that they the caused when fixing a bug that does affect them), and we need time and effort from the buildbot maintainers (note the same applies even if the "buildbot" is actually a separate process run by Zac out of a github repo). Even if the bug only appears in one out of every 4 CI runs, it's highly likely that it will be found and fixed before it makes it into production, or at least much more quickly, considering that most PRs go through a few edit cycles, and a good fraction of them are backported to 2-3 branches, all with separate CI runs. It's a much quicker feedback loop. I think there's an argument to be made that incorporating more third-party libraries (in general) into our CI build might cause headaches, but I think that is not a problem specific to hypothesis, and I think its one where we can find a reasonable balance that allows us to use hypothesis in one form or another in the standard library.

@Terry

> The problem with random input tests in not that they are 'flakey', but that they are useless unless someone is going to pay attention to failures and try to find the cause. This touches on the difference between regression testing and bug-finding tests. CPython CI is the former, and marred at that by buggy randomly failing tests.

> My conclusion: bug testing would likely be a good idea, but should be done separate from the CI test suite. Such testing should only be done for modules with an active maintainer who would welcome failure reports.

Are you saying that random input tests are flaky but that that is not the big problem? In my experience using hypothesis, in practice it is not the case that you get tests that fail randomly. The majority of the time if your code doesn't violate one of the properties, the tests fail the first time you run the test suite (this is particularly true for strategies where hypothesis deliberately makes it more likely that you'll get a "nasty" input by biasing the random selection algorithm in that direction). In a smaller number of cases, I see failures that happen on the second, third or fourth run.

That said, if it were a concern that every run of the tests is using different inputs (and thus you might see a bug that only appears once in every 20 runs), it is possible to run hypothesis in a configuration where you specify the seed, making it so that hypothesis always runs the same set of inputs for the same tests. We can disable that on a separate non-CI run for hypothesis "fuzzing" that would run the test suite for longer (or indefinitely) looking for long-tail violations of these properties.

I feel that if we don't at least run some form of the hypothesis tests in CI, there will likely be bit rot and the tests will decay in usefulness. Consider the case where someone accidentally breaks an edge case that makes it so that `json.loads(json.dumps(o))` no longer works for some obscure value of `o`. With hypothesis tests running in CI, we are MUCH more likely to find this bug / regression during the initial PR that would break the edge case than if we run it separately and report it later. If we run the hypothesis tests in a build-bot, the process would be:

1. Contributor makes PR with passing CI.
2. Core dev review passes, PR is merged.
3. Buildbot run occurs and the buildbot watch is notified.
4. Buildbot maintainers track down the PR responsible and either file a new bug or comment on the old bug.
5. Someone makes a NEW PR adding a regression test and the fix for the old PR.
6. Core dev review passes, second PR is merged.

If we run it in CI, the process would be:

1. Contributor makes PR, CI breaks.
2. If the contributor doesn't notice the broken CI, core dev points it out and it is fixed (or the PR is scrapped as unworkable).

Note that in the non-CI process, we need TWO core dev reviews, we need TWO PRs (people are not always super motivated to fix bugs that don't affect them that they the caused when fixing a bug that does affect them), and we need time and effort from the buildbot maintainers (note the same applies even if the "buildbot" is actually a separate process run by Zac out of a github repo).

Even if the bug only appears in one out of every 4 CI runs, it's highly likely that it will be found and fixed before it makes it into production, or at least much more quickly, considering that most PRs go through a few edit cycles, and a good fraction of them are backported to 2-3 branches, all with separate CI runs. It's a much quicker feedback loop.

I think there's an argument to be made that incorporating more third-party libraries (in general) into our CI build might cause headaches, but I think that is not a problem specific to hypothesis, and I think its one where we can find a reasonable balance that allows us to use hypothesis in one form or another in the standard library.

History
Date	User	Action	Args
2021-05-14 14:13:41	p-ganssle	set	recipients: + p-ganssle, gvanrossum, terry.reedy, Guido.van.Rossum, Zac Hatfield-Dodds
2021-05-14 14:13:41	p-ganssle	set	messageid: <1621001621.79.0.972438441052.issue42109@roundup.psfhosted.org>
2021-05-14 14:13:41	p-ganssle	link	issue42109 messages
2021-05-14 14:13:41	p-ganssle	create