Message 394378 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	p-ganssle
Recipients	Guido.van.Rossum, Zac Hatfield-Dodds, gvanrossum, p-ganssle, rhettinger, terry.reedy
Date	2021-05-25.16:40:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1621960803.19.0.799067898829.issue42109@roundup.psfhosted.org>
In-reply-to

Content
> I use hypothesis during development, but don't have a need for in the the standard library. By the time code lands there, we normally have a specific idea of what edge cases needs to be in the tests. The suggestion I've made here is that we use @example decorators to take the hypothesis tests you would have written already and turn them in to what are essentially parameterized tests. For anyone who doesn't want to explicitly run the hypothesis test suite, the tests you are apparently already writing would simply turn into normal tests for just the edge cases. One added benefit of keeping the tests around in the form of property tests is that you can run these same tests through hypothesis to find regressions in bugfixes that are implemented after landing (e.g. "Oh we can add a fast path here", which introduces a new edge case). The segfault bug from bpo-34454, for example, would have been found if I had been able to carry over the hypothesis-based tests I was using during the initial implementation of fromisoformat into later stages of the development. (Either because I didn't run it long enough to hit that particular edge case or because that edge case didn't come up until after I had moved the development locus into the CPython repo, I'm not sure). Another benefit of keeping them around is that they become fuzz targets, meaning people like oss-fuzz or anyone who wants to throw some fuzzing resources at CPython have an existing body of tests that are expected to pass on any input, to find especially obscure bugs. > For the most part, hypothesis has not turned up anything useful for the standard library. Most of the reports that we've gotten reflected a misunderstanding by the person running hypothesis rather than an actual bug. [...] I don't really think it's a good argument to say that it hasn't turned up useful bugs. Most of the bugs in a bit of code will be found during development or during the early stages of adoption, and we have very wide adoption. I've found a number of bugs in zoneinfo using hypothesis tests, and I'd love to continue using them in CPython rather than throwing them away or maintaining them in a separate repo. I also think it is very useful for us to write tests about the properties of our systems for re-use in PyPy (which does use hypothesis, by the way) and other implementations of Python. This kind of "define the contract and maintain tests to enforce that" is very helpful for alternate implementations. > For numeric code, hypothesis is difficult to use and requires many restrictions on the bounds of variables and error tolerances. [...] I do not think that we need to make hypothesis tests mandatory. They can be used when someone finds them useful. > The main area where hypothesis seems easy to use and gives some comfort is in simple roundtrips: assert zlib.decompress(zlib.compress(s)) == s. However, that is only a small fraction of our test cases. Even if this were the only time that hypothesis were useful (I don't think it is), some of these round-trips can be among the trickiest and important code to test, even if it's a small fraction of the tests. We have a bunch of functions that are basically "Load this file format" and "Dump this file format", usually implemented in C, which are a magnet for CVEs and often the target for fuzz testing for that reason. Having a small library of maintained tests for round tripping file formats seems like it would be very useful for people who want to donate compute time to fuzz test CPython (or other implementations!) > Speed is another issue. During development, it doesn't matter much if Hypothesis takes a lot of time exercising one function. But in the standard library tests already run slow enough to impact development. If hypothesis were to run everytime we run a test suite, it would make the situation worse. As mentioned in the initial ticket, the current plan I'm suggesting is to have fallback stubs which turn your property tests into parameterized tests when hypothesis is not installed. If we're good about adding `@example` decorators (and certainly doing so is easier than writing new ad hoc tests for every edge case we can think of when we already have property tests written!), then I don't see any particular reason to run the full test suite against a full hypothesis run on every CI run. My suggestion is: 1. By default, run hypothesis in "stubs" mode, where the property tests are simply parameterized tests. 2. Have one or two CI jobs that runs only the hypothesis tests, generating new examples — since this is just for edge case detection, it doesn't necessarily need to run on every combination of architecture and platform and configuration in our testing matrix, just the ones where it could plausibly make a difference. 3. Ask users who are adding new hypothesis tests or who find new bugs to add @example decorators when they fix a failing test case. With the stubs, these tests won't be any slower than other tests, and I think a single "hypothesis-only" CI job done in parallel with the other jobs won't break the compute bank. Notably, we can almost certainly disable coverage detection on the edge-case-only job as well, which I believe adds quite a bit of overhead. > There is also a learning curve issue. We're adding more and more things that newcomer's have to learn before they can contribute (how to run blurb, how to use the argument clinic, etc). Using Hypothesis is a learned skill and takes time. Again I don't think hypothesis tests are mandatory or that we plan to replace our existing testing framework with hypothesis tests. The median newcomer won't need to know anything about hypothesis testing any more than they need to know the details of the grammar or of the ceval loop to make a meaningful contribution. I don't even expect them to make up a large fraction of our tests. I'm a pretty big fan of hypothesis and used it extensively for zoneinfo, and yet only 13% of the tests (by line count) are property tests: $ wc -l tests/test_zoneinfo* 325 tests/test_zoneinfo_property.py 2138 tests/test_zoneinfo.py 2463 total I will not deny that there are costs to bringing this into the standard library, but I that they can largely be mitigated. The main concern is really ensuring that we can bring it on board without making our CI system more fragile and without shifting much (if any) concern about fixing bitrot associated with the hypothesis tests onto the buildbot maintainers and release managers.

> I use hypothesis during development, but don't have a need for in the the standard library.  By the time code lands there, we normally have a specific idea of what edge cases needs to be in the tests.

The suggestion I've made here is that we use @example decorators to take the hypothesis tests you would have written already and turn them in to what are essentially parameterized tests. For anyone who doesn't want to explicitly run the hypothesis test suite, the tests you are apparently already writing would simply turn into normal tests for just the edge cases.

One added benefit of keeping the tests around in the form of property tests is that you can run these same tests through hypothesis to find regressions in bugfixes that are implemented after landing (e.g. "Oh we can add a fast path here", which introduces a new edge case). The segfault bug from bpo-34454, for example, would have been found if I had been able to carry over the hypothesis-based tests I was using during the initial implementation of fromisoformat into later stages of the development. (Either because I didn't run it long enough to hit that particular edge case or because that edge case didn't come up until after I had moved the development locus into the CPython repo, I'm not sure).

Another benefit of keeping them around is that they become fuzz targets, meaning people like oss-fuzz or anyone who wants to throw some fuzzing resources at CPython have an existing body of tests that are expected to pass on *any* input, to find especially obscure bugs.

> For the most part, hypothesis has not turned up anything useful for the standard library.  Most of the reports that we've gotten reflected a misunderstanding by the person running hypothesis rather than an actual bug. [...]

I don't really think it's a good argument to say that it hasn't turned up useful bugs. Most of the bugs in a bit of code will be found during development or during the early stages of adoption, and we have very wide adoption. I've found a number of bugs in zoneinfo using hypothesis tests, and I'd love to continue using them in CPython rather than throwing them away or maintaining them in a separate repo.

I also think it is very useful for us to write tests about the properties of our systems for re-use in PyPy (which does use hypothesis, by the way) and other implementations of Python. This kind of "define the contract and maintain tests to enforce that" is very helpful for alternate implementations.

> For numeric code, hypothesis is difficult to use and requires many restrictions on the bounds of variables and error tolerances.  [...]

I do not think that we need to make hypothesis tests mandatory. They can be used when someone finds them useful.

> The main area where hypothesis seems easy to use and gives some comfort is in simple roundtrips:  assert zlib.decompress(zlib.compress(s)) == s.  However, that is only a small fraction of our test cases.

Even if this were the only time that hypothesis were useful (I don't think it is), some of these round-trips can be among the trickiest and important code to test, even if it's a small fraction of the tests. We have a bunch of functions that are basically "Load this file format" and "Dump this file format", usually implemented in C, which are a magnet for CVEs and often the target for fuzz testing for that reason. Having a small library of maintained tests for round tripping file formats seems like it would be very useful for people who want to donate compute time to fuzz test CPython (or other implementations!)

> Speed is another issue.  During development, it doesn't matter much if Hypothesis takes a lot of time exercising one function.  But in the standard library tests already run slow enough to impact development.  If hypothesis were to run everytime we run a test suite, it would make the situation worse.

As mentioned in the initial ticket, the current plan I'm suggesting is to have fallback stubs which turn your property tests into parameterized tests when hypothesis is not installed. If we're good about adding `@example` decorators (and certainly doing so is easier than writing new ad hoc tests for every edge case we can think of when we already have property tests written!), then I don't see any particular reason to run the full test suite against a full hypothesis run on every CI run.

My suggestion is:

1. By default, run hypothesis in "stubs" mode, where the property tests are simply parameterized tests.
2. Have one or two CI jobs that runs *only* the hypothesis tests, generating new examples — since this is just for edge case detection, it doesn't necessarily need to run on every combination of architecture and platform and configuration in our testing matrix, just the ones where it could plausibly make a difference.
3. Ask users who are adding new hypothesis tests or who find new bugs to add @example decorators when they fix a failing test case.

With the stubs, these tests won't be any slower than other tests, and I think a single "hypothesis-only" CI job done in parallel with the other jobs won't break the compute bank. Notably, we can almost certainly disable coverage detection on the edge-case-only job as well, which I believe adds quite a bit of overhead.

> There is also a learning curve issue.  We're adding more and more things that newcomer's have to learn before they can contribute (how to run blurb, how to use the argument clinic, etc).  Using Hypothesis is a learned skill and takes time.

Again I don't think hypothesis tests are *mandatory* or that we plan to replace our existing testing framework with hypothesis tests. The median newcomer won't need to know anything about hypothesis testing any more than they need to know the details of the grammar or of the ceval loop to make a meaningful contribution. I don't even expect them to make up a large fraction of our tests. I'm a pretty big fan of hypothesis and used it extensively for zoneinfo, and yet only 13% of the tests (by line count) are property tests:

$ wc -l tests/test_zoneinfo*
  325 tests/test_zoneinfo_property.py
 2138 tests/test_zoneinfo.py
 2463 total

I will not deny that there *are* costs to bringing this into the standard library, but I that they can largely be mitigated. The main concern is really ensuring that we can bring it on board without making our CI system more fragile and without shifting much (if any) concern about fixing bitrot associated with the hypothesis tests onto the buildbot maintainers and release managers.

History
Date	User	Action	Args
2021-05-25 16:40:03	p-ganssle	set	recipients: + p-ganssle, gvanrossum, rhettinger, terry.reedy, Guido.van.Rossum, Zac Hatfield-Dodds
2021-05-25 16:40:03	p-ganssle	set	messageid: <1621960803.19.0.799067898829.issue42109@roundup.psfhosted.org>
2021-05-25 16:40:03	p-ganssle	link	issue42109 messages
2021-05-25 16:40:02	p-ganssle	create