classification
Title: Use hypothesis for testing the standard library, falling back to stubs
Type: enhancement Stage: patch review
Components: Tests Versions: Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: p-ganssle Nosy List: Guido.van.Rossum, Zac Hatfield-Dodds, brandtbucher, gvanrossum, p-ganssle, rhettinger, terry.reedy
Priority: normal Keywords: patch

Created on 2020-10-21 17:38 by p-ganssle, last changed 2021-05-27 17:42 by brandtbucher.

Pull Requests
URL Status Linked Edit
PR 22863 open p-ganssle, 2020-10-21 17:43
Messages (16)
msg379226 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2020-10-21 17:38
Following up on this year's language summit, I would like to propose that we start integrating property tests into the standard library. Zac Hatfield-Dodds (along with myself and Carl Friedrich Bolz-Tereick) have put together this repository of tests that we run against the standard library as is: https://github.com/Zac-HD/stdlib-property-tests

Here is the blog post covering the proposal from the language summit: https://pyfound.blogspot.com/2020/05/property-based-testing-for-python.html

The biggest challenges here are logistical:

1. Pulling in third party dependencies is difficult to do reliably on CI, but checking in hypothesis's entire dependency tree is probably not feasible.
2. We don't necessarily want to require developers running their tests locally to have to set up a hypothesis environment just to run the tests.
3. Hypothesis tests are not (by default) deterministic, which some are concerned may lead to flakiness by themselves.


To allay these concerns, I propose that we implement a compatibility interface for hypothesis that uses the third party module when it's installed, but otherwise falls back to stubs. The way I see the stubs working is that `@given` *without* `@example`s would simply skip the test. If you specify `@given` and one or more `@example`s, the test falls back to a simple parameterized test when hypothesis is not available.

At least at first, we won't attempt to add a mandatory PR job that runs with hypothesis installed. Instead, I'd like to run either an optional job on PR or have one or more buildbots that runs the hypothesis tests.

I would also like to suggest a policy of including at least one example in each property test, so that on PR at least some of the inputs are tested.
msg393616 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-05-13 22:15
I would like to have a more thorough discussion about the desirability of using Hypothesis first, since I feel that there is a rather hard "sell" going on. I brought this up in the SC tracker (https://github.com/python/steering-council/issues/65) but I don't want to have the discussion there.

Copying some quoted text from there:

[me]
> > Can we perhaps have an open discussion on the merits of Hypothesis itself on python-dev before committing to this?

[Paul]
> You mean hypothesis specifically or property-based testing in general? I think that if we add property-based testing, hypothesis is probably the only natural choice. Of course, I can make this case on the thread if that was your intention.

I don't even know what property-based testing means (for Python) other than Hypothesis. What other frameworks support that? In any case, nobody has been promoting anything else, but Zac has been on our case for over a year. :-)

[me]
> > It seems to promote a coding style that's more functional than is good for Python, and its inner workings seem too magical to me. Also the decorator-based DSL looks pretty inscrutable to me.

[Paul]
> Do you mean it's promoting a coding style within the tests, or in the public API? When designing zoneinfo I didn't think about how easy it would be to test with Hypothesis at all as part of the API and it was basically frictionless.

I meant in the public API. I don't doubt that for zoneinfo this worked well, but that doesn't prove it's any better than any other framework (maybe you would have been ecstatic if you could have used pytest as well :-). Fundamentally, I am a bit suspicious of Hypothesis' origins, Haskell, which is a language and community that just have a different approach to programming than Python. Don't get me wrong, I don't think there's anything wrong with Haskell (except that I've noticed it's particularly popular with the IQ >= 150 crowd :-). It's just different than Python, and just as we don't want to encourage writing "Fortran or Java in Python", I don't think it's a good idea to recommend writing "Haskell in Python".

> I think there are situations where it makes sense to write functional-style strategies because it can in some situations give hypothesis more information about how the strategies are transformed (and thus allow it to optimize how the data is drawn rather than drawing from the full set and discarding a bunch of stuff that doesn't match, or doing a bunch of extra logic on each draw), but it's by no means required. I'd say most of my "complex strategies" are decorated functions rather than chains of maps and filters.

Ah, here we get to the crux of the matter. What's a "strategy"? Maybe I would be more inclined to support Hypothesis if it was clearer what it does. I like libraries that do stuff for me that I know how to do myself. With Hypothesis, the deeper I get into the docs, the more I get the impression that there is deep magic going on that a poor sap like myself isn't supposed to understand. When I write @given(st.lists(st.integers)) what does that do? I haven't the foggiest idea, and from the docs I get the impression I'm not supposed to worry about it, and *that* worries me. Clearly it can't be generating all lists of 0 integers followed by all lists of 1 integer followed by all lists of 2 integers ... So in what order does it generate test data? The docs are pretty vague about that.

(I'm not playing dumb here. I really have no idea how to go about that, and it seems that this is at the heart of Hypothesis and its' Haskell ancestors. Presumably Ph.D theses went into making this sort of thing work. Perhaps more effort could be expended explaining this part to laypeople?)

> From the stdlib-property-tests repo, this seems like the most complicated strategy I'm seeing employed, and I find it fairly simple to understand, and I don't know how easy to read any code would be that generates trees of arbitrary depth containing objects with specific properties.

I have never tried that sort of thing, so I'll take your word for it.

But there are other strategies. It seems that a lot of the strategies are answers to a series of questions, along the lines of "how do I generate valid URLs"; and then "email addresses"; and then "IP addresses"; and then suddenly we find a strategy that generates *functions*, and then we get "recursive things", and it keeps going. Timezone keys?! No wonder you found it easy to use. :-)

I think I've heard that there's a strategy somewhere that generates random Python programs. How would that even work? Would it help me find corner cases in the grammar? I suppose we could actually use something like that for the "faster CPython" project, to validate that the optimizer doesn't break things. But how? Do I just write

@given(st.python_programs(), st.values())
def test_optimizer(func, arg):
    expected = func(arg)
    optimized = optimizer(func)(arg)
    assert optimized == expected

???

Methinks that the hard work then is writing the python_programs() strategy, so I'd want to understand that before trusting it.

I suspect I've repeated myself enough times at this point :-), so I hope I'll get some sort of answer. Somehow it seems in its zeal to "sell" itself as super-easy to use in all sorts of circumstances, Hypotheses has buried the lede -- how do you write effective strategies?
msg393620 - (view) Author: Zac Hatfield-Dodds (Zac Hatfield-Dodds) * Date: 2021-05-14 04:21
(I think structuring this as a high-level explanation is clearer than point-by-point replies - it covers most of them, but the logic is hopefully easier to follow)


The core idea of Hypothesis is that making random choices is equivalent to parsing a random bitstring:
    - wrap your test function in a decorator which calls it many times with random inputs
    - describe those inputs with strategies, (which are parser combinators)
    - the bytes parsed by strategies can be provided by a Random() instance, saved to and loaded from a database, edited and (maybe) re-parsed into related or simpler inputs, etc.
Minithesis [1] is a complete implementation of this core, including strategies, shrinking, and the database, in a few hundred lines of Python.  I think reading this will answer most of your implementation questions.

Hypothesis is not the *only* property-based testing library for Python (there's pytest-quickcheck), but in the same sense that `six` is not the only py2/py3 compatibility layer.  I attribute this to a combination of a better underlying model [2], and brute implementation effort to provide great error messages, integration with popular libraries, lots of strategies, related tooling, and so on.

[1] https://github.com/DRMacIver/minithesis
    See also https://hypothesis.works/articles/how-hypothesis-works/ for an older writeup; the implementation details are a little outdated but it explains the motivations better.
[2] https://hypothesis.works/articles/types-and-properties/



So a "strategy" is an object (implementation detail: a parser combinator, instance of SearchStrategy), which tells Hypothesis how to convert a random bytestring into whatever object you've asked for.  This is a little unusual, but I haven't found a better way which still makes it easy to mix the strategy primitives with arbitrary user code such as "pick a username, ensure it's in the database, and then pass it to my test method".

I share your distaste for "Haskell in Python"; I don't think Hypothesis introduces more of that flavor than Numpy would introduce Fortran idioms - there's occasionally a resemblance, but they're both Python-first libraries and I'd attribute it mostly to the problem being solved.  FWIW Hypothesis deliberately rejected most of the design choices of QuickCheck, and implementation-wise has more in common with unixy fuzzing tools like AFL.

The strategies in our API can generally be divided into useful primitives (e.g. integers, floats, lists, tuples, sampled_from, one_of) and pre-composed strategies (booleans = sampled_from, dictionaries = map+lists-of-kv-tuples, builds = tuples+fixed_dictionaries+map).
Very specific strategies like emails() are included because they're a common need, and a naive implementation usually omits exactly the weirder and error-prone features that might reveal bugs.

We _are_ deliberately vague about the order in which we generate test data, because the distribution is full of heuristics and adapts itself to the behaviour of the test - including over multiple runs, thanks to the database.  The exact details are also considered an implementation detail, and subject to change as we discover heuristics and sampling techniques which find bugs faster.

The upside here is writing effective strategies is straightforward:
    - expressing more possible (valid) inputs is good, because e.g. if the bug triggers on NaN we'd have to generate some NaNs to detect it; and
    - limiting use of rejection sampling (hypothesis.assume(), the .filter() method) is an obvious performance improvement - if only 20% of inputs we try to generate are valid, the test will take five times longer.
following which Hypothesis' backend (the "conjecture" engine) uses a large helping of heuristics and fancy code to ensure that we get a nice diverse sample from the inputs space.  Unfortunately there's a direct tension between high performance and explainable behaviour; Minithesis is readable mostly by virtue of not handling the hard edge cases.



I'll take my Hypothesmith [3] project for random Python programs as an example of a complex strategy.  It was inspired by CSmith [4] - but is considerably less sophisticated.  CSmith constructs semantically valid C programs; Hypothesmith (for now!) constructs a subset of syntatically-valid Python.  The basic idea is to "invert" the grammar: start with the root node, recursively construct a valid parse tree, and then serialise it out to a parsable input string.  Doing a better job would require more time than I had for a proof-of-concept, but no new insights; and the same applies to defining a strategy for semantically-valid code.  Being based on the grammar I'd guess it's unlikely to find corner cases in the grammar, but it might do so in the parser, or via e.g. `(tree := ast.parse(code)) == ast.parse(ast.unparse(tree))`.

Your distrust of the python_programs() strategy is entirely reasonable, and I'd also want to understand it.  I'd just add that it doesn't need to be elegant, or general, or express all inputs... so long as it can find valid inputs that reveal a bug, more easily or cheaply than users or other testing techniques :-)

To test compiler optimizations, yes, once you had the strategy that exact test body would work.  An even more effective option is to use "equivalence modulo inputs" [5] to derive `func2` and also assert that `optimizer(func)(arg) == optimizer(func2)(arg)`.

[3] https://pypi.org/project/hypothesmith/ ; in action at https://github.com/psf/black/blob/main/fuzz.py
[4] https://embed.cs.utah.edu/csmith/
[5] http://vuminhle.com/pdf/pldi14-emi.pdf



I hope that helps; in particular Minithesis should take most of the magic out of it.
msg393621 - (view) Author: Guido van Rossum (Guido.van.Rossum) Date: 2021-05-14 04:44
Thanks, I will look into that before stirring up more dust.
msg393633 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-05-14 08:45
This issue confuses two different testing issues.  First is selection of test data before the test versus generation while testing (usually 'randomly')  Second is how the function is tested.  All tests test some property of the inputs and function, but it is useful to distinguish between testing that a function produces a specific expected output versus testing something more abstract or defined in terms of the inputs.

Example 1: 2 + 3 = 5 (a + b = (known) c) versus 2 + 3 = 3 + 2 (a + b = b + a) and (2 + 3) + 4 = 2 + (3 + 4).

Example 2: sorted([3,5,1,4]) = [1,3,4,5] versus is_sorted(sorted([random list]).  

The latter are what hypothesis people mean by 'property tests'.  Such property tests can be done with either fixed or random pairs in unittest module tests.  The test that a module compiles is a weak property test, but better than nothing.  A test that that all lines of code runs at least once without an unexpected exception a stronger property test.  I consider it roughly on a par with min(s) <= mean(s) <= max(s).

The connection between randomized input testing, such as with hypothesis, and property tests is that with random inputs, one cannot test f(input) against an 'expected' value unless one has an alternate means of producing the expected value.  It is often easier to test a parameterized equation or other properties.

The problem with random input tests in not that they are 'flakey', but that they are useless unless someone is going to pay attention to failures and try to find the cause.  This touches on the difference between regression testing and bug-finding tests.  CPython CI is the former, and marred at that by buggy randomly failing tests.

My conclusion: bug testing would likely be a good idea, but should be done separate from the CI test suite.  Such testing should only be done for modules with an active maintainer who would welcome failure reports.
msg393661 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2021-05-14 14:13
@Terry

> The problem with random input tests in not that they are 'flakey', but that they are useless unless someone is going to pay attention to failures and try to find the cause.  This touches on the difference between regression testing and bug-finding tests.  CPython CI is the former, and marred at that by buggy randomly failing tests.

> My conclusion: bug testing would likely be a good idea, but should be done separate from the CI test suite.  Such testing should only be done for modules with an active maintainer who would welcome failure reports.

Are you saying that random input tests are flaky but that that is not the big problem? In my experience using hypothesis, in practice it is not the case that you get tests that fail randomly. The majority of the time if your code doesn't violate one of the properties, the tests fail the first time you run the test suite (this is particularly true for strategies where hypothesis deliberately makes it more likely that you'll get a "nasty" input by biasing the random selection algorithm in that direction). In a smaller number of cases, I see failures that happen on the second, third or fourth run.

That said, if it were a concern that every run of the tests is using different inputs (and thus you might see a bug that only appears once in every 20 runs), it is possible to run hypothesis in a configuration where you specify the seed, making it so that hypothesis always runs the same set of inputs for the same tests. We can disable that on a separate non-CI run for hypothesis "fuzzing" that would run the test suite for longer (or indefinitely) looking for long-tail violations of these properties.

I feel that if we don't at least run some form of the hypothesis tests in CI, there will likely be bit rot and the tests will decay in usefulness. Consider the case where someone accidentally breaks an edge case that makes it so that `json.loads(json.dumps(o))` no longer works for some obscure value of `o`. With hypothesis tests running in CI, we are MUCH more likely to find this bug / regression during the initial PR that would break the edge case than if we run it separately and report it later. If we run the hypothesis tests in a build-bot, the process would be:

1. Contributor makes PR with passing CI.
2. Core dev review passes, PR is merged.
3. Buildbot run occurs and the buildbot watch is notified.
4. Buildbot maintainers track down the PR responsible and either file a new bug or comment on the old bug.
5. Someone makes a NEW PR adding a regression test and the fix for the old PR.
6. Core dev review passes, second PR is merged.

If we run it in CI, the process would be:

1. Contributor makes PR, CI breaks.
2. If the contributor doesn't notice the broken CI, core dev points it out and it is fixed (or the PR is scrapped as unworkable).

Note that in the non-CI process, we need TWO core dev reviews, we need TWO PRs (people are not always super motivated to fix bugs that don't affect them that they the caused when fixing a bug that does affect them), and we need time and effort from the buildbot maintainers (note the same applies even if the "buildbot" is actually a separate process run by Zac out of a github repo).

Even if the bug only appears in one out of every 4 CI runs, it's highly likely that it will be found and fixed before it makes it into production, or at least much more quickly, considering that most PRs go through a few edit cycles, and a good fraction of them are backported to 2-3 branches, all with separate CI runs. It's a much quicker feedback loop.

I think there's an argument to be made that incorporating more third-party libraries (in general) into our CI build might cause headaches, but I think that is not a problem specific to hypothesis, and I think its one where we can find a reasonable balance that allows us to use hypothesis in one form or another in the standard library.
msg393922 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-05-19 04:45
Okay, well, I'm trying to understand minithesis.py, but I am despairing. The shrinking code (really everything in class TestingState) is too much to grok. You can tell from the amount of comments that this is the tour-de-force of the code, but the comments don't really help -- they seem to be mostly notes for people trying to reimplement this in another language. (Plus some snide remarks about QuickCheck.)

I suppose I have to read the paper at https://drmaciver.github.io/papers/reduction-via-generation-preview.pdf ... At least that has some drawings.

I am still unclear on some basics too, but presumably it's all related.

PS. What does "shortlex" mean?
msg393926 - (view) Author: Zac Hatfield-Dodds (Zac Hatfield-Dodds) * Date: 2021-05-19 07:04
> Okay, well, I'm trying to understand minithesis.py, but I am despairing. The shrinking code (really everything in class TestingState) is too much to grok. ... I suppose I have to read the paper at https://drmaciver.github.io/papers/reduction-via-generation-preview.pdf ... I am still unclear on some basics too, but presumably it's all related.

Shrinking is probably not important to understand, since it's *not* particularly related to anything but implementation details.  The abstract of that paper explains why the design is important - integrated shrinking means we can report minimal (easier-to-understand) failing examples with zero user effort, and guarantee that all shrunk examples could have been generated.

TestingState is, roughly, the `@given()` decorator internals: responsible for deciding how many times to run the wrapped test function, and how.  Note the three phases (generate, target, shrink) - the distribution of examples can depend on the behaviour of the test function on earlier inputs.  This is _more_ complex in Hypothesis and thus challenging to document.

The important concept is that input examples are generated by a Possibility (ie Strategy), which internally makes a sequence of choices; and that sequence is sourced from a maybe-empty known prefix (e.g. replaying failures, mutating previous inputs) with the suffix determined by a PRNG.


I feel like I'm explaining this badly, but there isn't a simple underlying thing to explain - Hypothesis works well because the user API is relatively simple, and then under the hood we have this many-layered stack of implementation details.  The synergies and coverage of otherwise-pathological cases this delivers is excellent, but "multiple years and PhDs worth of detailed synergistic work" is just intrinsically complicated.


> PS. What does "shortlex" mean?

Sort by length, and among equal-length strings sort in the normal (alphabetical) order.  For example: A < XY < ABC, etc.
msg393979 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-05-19 21:42
Few things incite me quite as much as being told that I'm not supposed to understand something, so now I'm actually making an effort to follow the ECOOP paper. Honestly, it's not too bad (a lot more readable than type theory papers :-). I'll get back to you once I'm more comfortable in mapping between that and minithesis.py.
msg393981 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2021-05-19 22:11
I do not want to dissuade you from figuring out how minithesis / hypothesis works (far from it), but I'm wondering if the question of how shrinking works is germane to the issue at hand, which is whether or not hypothesis / property-based-testing is suitable for testing the standard library.

I almost don't think it matters *if* shrinking works, much less *how*. Shrinking is something of a UI nicety in the sense that when hypothesis finds an example that violates a property, it will try to take whatever example it has chosen and find something like a "minimal working example" to show to the end user. So if we have something like:

```
@given(x=strategies.integers())
def f(x):
    assert x >= 0
```

Hypothesis will presumably tell us that `x == -1` will violate the property instead of `-24948929` or some other random thing. But if it did report `-24948929` or something, it wouldn't be too hard to see "oh, right, integers can be negative". In some cases with the time zone stuff, there isn't a good metric for complexity at the moment, so time zones are sorted by their IANA key (which is to say, basically arbitrary), and I generally find it useful anyway (admittedly, the shrinking still is helpful because most problems affecting all zones will return Africa/Abuja, whereas things particular to a specific zone's odd time zone history will return whichever zone with that quirk comes alphabetically first).

Anyway, do not let me disrupt your process, I just thought it might be worth making the point that some of these specific details might be nice to know, but don't seem like they should be blockers for hypothesis' adoption in the standard library.
msg393986 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-05-19 23:21
It is important to me, because I don't like having a dependency on
third-party code that is a black box. Remember that in the end it's not me
you have to convince but the SC.
msg394143 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-05-21 19:18
Having read the parts of the paper that explain shortening, things are making more sense now. I have also successfully contributed type annotations to minithesis! :-)
msg394279 - (view) Author: Guido van Rossum (Guido.van.Rossum) Date: 2021-05-25 00:44
I withdraw my objection against Hypothetlsis per se. Others should debate
whether the 3rd party dependency is okay. --
--Guido (mobile)
msg394282 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-05-25 01:21
Here's my two cents worth on the subject.  

* I use hypothesis during development, but don't have a need for in the the standard library.  By the time code lands there, we normally have a specific idea of what edge cases needs to be in the tests.

* For the most part, hypothesis has not turned up anything useful for the standard library.  Most of the reports that we've gotten reflected a misunderstanding by the person running hypothesis rather than an actual bug.  For example, in colorsys, an issue was reported about the color conversions not round-tripping, but that is an expected behavior for out of gamut conversions.  Also, the matrices used are dictated by standards and do not give exact inverses even for in gamut conversions.

* For numeric code, hypothesis is difficult to use and requires many restrictions on the bounds of variables and error tolerances.  For example, when I have students with a function to evaluate the quadratic formula and test it with hypothesis, they struggle mightily (overflowing the discriminant, having 2*a close to zero, deciding whether the roots are sufficiently close, etc.)

* The main area where hypothesis seems easy to use and gives some comfort is in simple roundtrips:  assert zlib.decompress(zlib.compress(s)) == s.  However, that is only a small fraction of our test cases.

* Speed is another issue.  During development, it doesn't matter much if Hypothesis takes a lot of time exercising one function.  But in the standard library tests already run slow enough to impact development.  If hypothesis were to run everytime we run a test suite, it would make the situation worse.

* There is also a learning curve issue.  We're adding more and more things that newcomer's have to learn before they can contribute (how to run blurb, how to use the argument clinic, etc).  Using Hypothesis is a learned skill and takes time.

TL;DR  I really like Hypothesis but think it doesn't belong in standard library tests.
msg394378 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2021-05-25 16:40
> I use hypothesis during development, but don't have a need for in the the standard library.  By the time code lands there, we normally have a specific idea of what edge cases needs to be in the tests.

The suggestion I've made here is that we use @example decorators to take the hypothesis tests you would have written already and turn them in to what are essentially parameterized tests. For anyone who doesn't want to explicitly run the hypothesis test suite, the tests you are apparently already writing would simply turn into normal tests for just the edge cases.

One added benefit of keeping the tests around in the form of property tests is that you can run these same tests through hypothesis to find regressions in bugfixes that are implemented after landing (e.g. "Oh we can add a fast path here", which introduces a new edge case). The segfault bug from bpo-34454, for example, would have been found if I had been able to carry over the hypothesis-based tests I was using during the initial implementation of fromisoformat into later stages of the development. (Either because I didn't run it long enough to hit that particular edge case or because that edge case didn't come up until after I had moved the development locus into the CPython repo, I'm not sure).

Another benefit of keeping them around is that they become fuzz targets, meaning people like oss-fuzz or anyone who wants to throw some fuzzing resources at CPython have an existing body of tests that are expected to pass on *any* input, to find especially obscure bugs.

> For the most part, hypothesis has not turned up anything useful for the standard library.  Most of the reports that we've gotten reflected a misunderstanding by the person running hypothesis rather than an actual bug. [...]

I don't really think it's a good argument to say that it hasn't turned up useful bugs. Most of the bugs in a bit of code will be found during development or during the early stages of adoption, and we have very wide adoption. I've found a number of bugs in zoneinfo using hypothesis tests, and I'd love to continue using them in CPython rather than throwing them away or maintaining them in a separate repo.

I also think it is very useful for us to write tests about the properties of our systems for re-use in PyPy (which does use hypothesis, by the way) and other implementations of Python. This kind of "define the contract and maintain tests to enforce that" is very helpful for alternate implementations.

> For numeric code, hypothesis is difficult to use and requires many restrictions on the bounds of variables and error tolerances.  [...]

I do not think that we need to make hypothesis tests mandatory. They can be used when someone finds them useful.

> The main area where hypothesis seems easy to use and gives some comfort is in simple roundtrips:  assert zlib.decompress(zlib.compress(s)) == s.  However, that is only a small fraction of our test cases.

Even if this were the only time that hypothesis were useful (I don't think it is), some of these round-trips can be among the trickiest and important code to test, even if it's a small fraction of the tests. We have a bunch of functions that are basically "Load this file format" and "Dump this file format", usually implemented in C, which are a magnet for CVEs and often the target for fuzz testing for that reason. Having a small library of maintained tests for round tripping file formats seems like it would be very useful for people who want to donate compute time to fuzz test CPython (or other implementations!)

> Speed is another issue.  During development, it doesn't matter much if Hypothesis takes a lot of time exercising one function.  But in the standard library tests already run slow enough to impact development.  If hypothesis were to run everytime we run a test suite, it would make the situation worse.

As mentioned in the initial ticket, the current plan I'm suggesting is to have fallback stubs which turn your property tests into parameterized tests when hypothesis is not installed. If we're good about adding `@example` decorators (and certainly doing so is easier than writing new ad hoc tests for every edge case we can think of when we already have property tests written!), then I don't see any particular reason to run the full test suite against a full hypothesis run on every CI run.

My suggestion is:

1. By default, run hypothesis in "stubs" mode, where the property tests are simply parameterized tests.
2. Have one or two CI jobs that runs *only* the hypothesis tests, generating new examples — since this is just for edge case detection, it doesn't necessarily need to run on every combination of architecture and platform and configuration in our testing matrix, just the ones where it could plausibly make a difference.
3. Ask users who are adding new hypothesis tests or who find new bugs to add @example decorators when they fix a failing test case.

With the stubs, these tests won't be any slower than other tests, and I think a single "hypothesis-only" CI job done in parallel with the other jobs won't break the compute bank. Notably, we can almost certainly disable coverage detection on the edge-case-only job as well, which I believe adds quite a bit of overhead.

> There is also a learning curve issue.  We're adding more and more things that newcomer's have to learn before they can contribute (how to run blurb, how to use the argument clinic, etc).  Using Hypothesis is a learned skill and takes time.

Again I don't think hypothesis tests are *mandatory* or that we plan to replace our existing testing framework with hypothesis tests. The median newcomer won't need to know anything about hypothesis testing any more than they need to know the details of the grammar or of the ceval loop to make a meaningful contribution. I don't even expect them to make up a large fraction of our tests. I'm a pretty big fan of hypothesis and used it extensively for zoneinfo, and yet only 13% of the tests (by line count) are property tests:

$ wc -l tests/test_zoneinfo*
  325 tests/test_zoneinfo_property.py
 2138 tests/test_zoneinfo.py
 2463 total

I will not deny that there *are* costs to bringing this into the standard library, but I that they can largely be mitigated. The main concern is really ensuring that we can bring it on board without making our CI system more fragile and without shifting much (if any) concern about fixing bitrot associated with the hypothesis tests onto the buildbot maintainers and release managers.
msg394415 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-05-26 05:36
In order to understand what Paul is concretely proposing, I read and partly reviewed his PR.  I thought about maintainability.

Part A adds test.support.hypothesis_helper, similar in purpose to other specialized xyz_helper modules.  Hypothesis_helper imports either hypothesis itself or a replacement 'as hypothesis'.  The replacement is currently called '_hypothesis_stub', but its name is irrelevant as it is normally not seen by the user.

Part B adds a second zoneinfo test module, one that uses hypothesis. Besides further testing zoneinfo, it serves to test hypothesis_helper and partly justify its addition.  It starts with 'from hypothesis_helper import hypothesis.  The new test module has tests such as

    @hypothesis.given(
        dt=hypothesis.strategies.one_of(
            hypothesis.strategies.datetimes(), hypothesis.strategies.times()
        )
    )
    @hypothesis.example(dt=datetime.datetime.min)
    @hypothesis.example(dt=datetime.datetime.max)
    @hypothesis.example(dt=datetime.datetime(1970, 1, 1))
    @hypothesis.example(dt=datetime.datetime(2039, 1, 1))
    @hypothesis.example(dt=datetime.time(0))
    @hypothesis.example(dt=datetime.time(12, 0))
    @hypothesis.example(dt=datetime.time(23, 59, 59, 999999))
    def test_utc(self, dt):
        zi = self.klass("UTC")
        dt_zi = dt.replace(tzinfo=zi)

        self.assertEqual(dt_zi.utcoffset(), ZERO)
        self.assertEqual(dt_zi.dst(), ZERO)
        self.assertEqual(dt_zi.tzname(), "UTC")

@given always (Paul?) runs the examples as subtests.  When the replacement is imported, no randomized examples are added.

If some year an example were to fail, could a hypothesis-ignorant zoneinfo-knowing person deal with the failure?  I believe so, knowing that the above is equivalent to a test with "for dt in dt_list:\n  with self.subTest(...):\n    <body of function>"

Why not require such a rewriting?  Some reasons against: Rewriting by hand can lead to errors.  test_utc would have to be split into test_utc_ran(dom) and test_utc_pre(set).  Code would have to be duplicated unless factored into a third function.  For and with together add two indents, which sometimes squeezes assert onto more lines.  I believe that when hypothesis is present, there are advantages to including preset examples with given-generated examples.

Paul would like the PR to be a 'camel's nose' in the CPython tent.  This cuts both ways.  I think this PR should be judged on its own merits.  Opening possibilities can be a merit as long as not seen as pre-approval.  Our CI testing and buildbots are already configured to blame and annoy the wrong people for random failures.  I don't want more unless failure notices are sent to someone responsible for the failing code.
History
Date User Action Args
2021-05-27 17:42:02brandtbuchersetnosy: + brandtbucher
2021-05-26 05:36:26terry.reedysetmessages: + msg394415
versions: + Python 3.9
2021-05-25 16:40:03p-gansslesetmessages: + msg394378
2021-05-25 01:21:27rhettingersetnosy: + rhettinger
messages: + msg394282
2021-05-25 00:44:26Guido.van.Rossumsetmessages: + msg394279
2021-05-21 19:18:47gvanrossumsetmessages: + msg394143
2021-05-19 23:21:13gvanrossumsetmessages: + msg393986
2021-05-19 22:11:51p-gansslesetmessages: + msg393981
2021-05-19 21:42:47gvanrossumsetmessages: + msg393979
2021-05-19 07:04:10Zac Hatfield-Doddssetmessages: + msg393926
2021-05-19 04:45:06gvanrossumsetmessages: + msg393922
2021-05-14 14:13:41p-gansslesetmessages: + msg393661
2021-05-14 08:45:20terry.reedysetmessages: + msg393633
2021-05-14 04:44:16Guido.van.Rossumsetnosy: + Guido.van.Rossum
messages: + msg393621
2021-05-14 04:21:47Zac Hatfield-Doddssetmessages: + msg393620
2021-05-13 22:15:44gvanrossumsetnosy: + gvanrossum
messages: + msg393616
2020-10-21 17:43:43p-gansslesetkeywords: + patch
pull_requests: + pull_request21805
2020-10-21 17:38:33p-gansslecreate