Message 393620 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Zac Hatfield-Dodds
Recipients	Zac Hatfield-Dodds, gvanrossum, p-ganssle, terry.reedy
Date	2021-05-14.04:21:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1620966107.54.0.339435983349.issue42109@roundup.psfhosted.org>
In-reply-to

Content
(I think structuring this as a high-level explanation is clearer than point-by-point replies - it covers most of them, but the logic is hopefully easier to follow) The core idea of Hypothesis is that making random choices is equivalent to parsing a random bitstring: - wrap your test function in a decorator which calls it many times with random inputs - describe those inputs with strategies, (which are parser combinators) - the bytes parsed by strategies can be provided by a Random() instance, saved to and loaded from a database, edited and (maybe) re-parsed into related or simpler inputs, etc. Minithesis [1] is a complete implementation of this core, including strategies, shrinking, and the database, in a few hundred lines of Python. I think reading this will answer most of your implementation questions. Hypothesis is not the only property-based testing library for Python (there's pytest-quickcheck), but in the same sense that `six` is not the only py2/py3 compatibility layer. I attribute this to a combination of a better underlying model [2], and brute implementation effort to provide great error messages, integration with popular libraries, lots of strategies, related tooling, and so on. [1] https://github.com/DRMacIver/minithesis See also https://hypothesis.works/articles/how-hypothesis-works/ for an older writeup; the implementation details are a little outdated but it explains the motivations better. [2] https://hypothesis.works/articles/types-and-properties/ So a "strategy" is an object (implementation detail: a parser combinator, instance of SearchStrategy), which tells Hypothesis how to convert a random bytestring into whatever object you've asked for. This is a little unusual, but I haven't found a better way which still makes it easy to mix the strategy primitives with arbitrary user code such as "pick a username, ensure it's in the database, and then pass it to my test method". I share your distaste for "Haskell in Python"; I don't think Hypothesis introduces more of that flavor than Numpy would introduce Fortran idioms - there's occasionally a resemblance, but they're both Python-first libraries and I'd attribute it mostly to the problem being solved. FWIW Hypothesis deliberately rejected most of the design choices of QuickCheck, and implementation-wise has more in common with unixy fuzzing tools like AFL. The strategies in our API can generally be divided into useful primitives (e.g. integers, floats, lists, tuples, sampled_from, one_of) and pre-composed strategies (booleans = sampled_from, dictionaries = map+lists-of-kv-tuples, builds = tuples+fixed_dictionaries+map). Very specific strategies like emails() are included because they're a common need, and a naive implementation usually omits exactly the weirder and error-prone features that might reveal bugs. We _are_ deliberately vague about the order in which we generate test data, because the distribution is full of heuristics and adapts itself to the behaviour of the test - including over multiple runs, thanks to the database. The exact details are also considered an implementation detail, and subject to change as we discover heuristics and sampling techniques which find bugs faster. The upside here is writing effective strategies is straightforward: - expressing more possible (valid) inputs is good, because e.g. if the bug triggers on NaN we'd have to generate some NaNs to detect it; and - limiting use of rejection sampling (hypothesis.assume(), the .filter() method) is an obvious performance improvement - if only 20% of inputs we try to generate are valid, the test will take five times longer. following which Hypothesis' backend (the "conjecture" engine) uses a large helping of heuristics and fancy code to ensure that we get a nice diverse sample from the inputs space. Unfortunately there's a direct tension between high performance and explainable behaviour; Minithesis is readable mostly by virtue of not handling the hard edge cases. I'll take my Hypothesmith [3] project for random Python programs as an example of a complex strategy. It was inspired by CSmith [4] - but is considerably less sophisticated. CSmith constructs semantically valid C programs; Hypothesmith (for now!) constructs a subset of syntatically-valid Python. The basic idea is to "invert" the grammar: start with the root node, recursively construct a valid parse tree, and then serialise it out to a parsable input string. Doing a better job would require more time than I had for a proof-of-concept, but no new insights; and the same applies to defining a strategy for semantically-valid code. Being based on the grammar I'd guess it's unlikely to find corner cases in the grammar, but it might do so in the parser, or via e.g. `(tree := ast.parse(code)) == ast.parse(ast.unparse(tree))`. Your distrust of the python_programs() strategy is entirely reasonable, and I'd also want to understand it. I'd just add that it doesn't need to be elegant, or general, or express all inputs... so long as it can find valid inputs that reveal a bug, more easily or cheaply than users or other testing techniques :-) To test compiler optimizations, yes, once you had the strategy that exact test body would work. An even more effective option is to use "equivalence modulo inputs" [5] to derive `func2` and also assert that `optimizer(func)(arg) == optimizer(func2)(arg)`. [3] https://pypi.org/project/hypothesmith/ ; in action at https://github.com/psf/black/blob/main/fuzz.py [4] https://embed.cs.utah.edu/csmith/ [5] http://vuminhle.com/pdf/pldi14-emi.pdf I hope that helps; in particular Minithesis should take most of the magic out of it.

(I think structuring this as a high-level explanation is clearer than point-by-point replies - it covers most of them, but the logic is hopefully easier to follow)

The core idea of Hypothesis is that making random choices is equivalent to parsing a random bitstring:
- wrap your test function in a decorator which calls it many times with random inputs
- describe those inputs with strategies, (which are parser combinators)
- the bytes parsed by strategies can be provided by a Random() instance, saved to and loaded from a database, edited and (maybe) re-parsed into related or simpler inputs, etc.
Minithesis [1] is a complete implementation of this core, including strategies, shrinking, and the database, in a few hundred lines of Python. I think reading this will answer most of your implementation questions.

Hypothesis is not the *only* property-based testing library for Python (there's pytest-quickcheck), but in the same sense that `six` is not the only py2/py3 compatibility layer. I attribute this to a combination of a better underlying model [2], and brute implementation effort to provide great error messages, integration with popular libraries, lots of strategies, related tooling, and so on.

[1] https://github.com/DRMacIver/minithesis
See also https://hypothesis.works/articles/how-hypothesis-works/ for an older writeup; the implementation details are a little outdated but it explains the motivations better.
[2] https://hypothesis.works/articles/types-and-properties/

So a "strategy" is an object (implementation detail: a parser combinator, instance of SearchStrategy), which tells Hypothesis how to convert a random bytestring into whatever object you've asked for. This is a little unusual, but I haven't found a better way which still makes it easy to mix the strategy primitives with arbitrary user code such as "pick a username, ensure it's in the database, and then pass it to my test method".

I share your distaste for "Haskell in Python"; I don't think Hypothesis introduces more of that flavor than Numpy would introduce Fortran idioms - there's occasionally a resemblance, but they're both Python-first libraries and I'd attribute it mostly to the problem being solved. FWIW Hypothesis deliberately rejected most of the design choices of QuickCheck, and implementation-wise has more in common with unixy fuzzing tools like AFL.

The strategies in our API can generally be divided into useful primitives (e.g. integers, floats, lists, tuples, sampled_from, one_of) and pre-composed strategies (booleans = sampled_from, dictionaries = map+lists-of-kv-tuples, builds = tuples+fixed_dictionaries+map).
Very specific strategies like emails() are included because they're a common need, and a naive implementation usually omits exactly the weirder and error-prone features that might reveal bugs.

We _are_ deliberately vague about the order in which we generate test data, because the distribution is full of heuristics and adapts itself to the behaviour of the test - including over multiple runs, thanks to the database. The exact details are also considered an implementation detail, and subject to change as we discover heuristics and sampling techniques which find bugs faster.

The upside here is writing effective strategies is straightforward:
- expressing more possible (valid) inputs is good, because e.g. if the bug triggers on NaN we'd have to generate some NaNs to detect it; and
- limiting use of rejection sampling (hypothesis.assume(), the .filter() method) is an obvious performance improvement - if only 20% of inputs we try to generate are valid, the test will take five times longer.
following which Hypothesis' backend (the "conjecture" engine) uses a large helping of heuristics and fancy code to ensure that we get a nice diverse sample from the inputs space. Unfortunately there's a direct tension between high performance and explainable behaviour; Minithesis is readable mostly by virtue of not handling the hard edge cases.

I'll take my Hypothesmith [3] project for random Python programs as an example of a complex strategy. It was inspired by CSmith [4] - but is considerably less sophisticated. CSmith constructs semantically valid C programs; Hypothesmith (for now!) constructs a subset of syntatically-valid Python. The basic idea is to "invert" the grammar: start with the root node, recursively construct a valid parse tree, and then serialise it out to a parsable input string. Doing a better job would require more time than I had for a proof-of-concept, but no new insights; and the same applies to defining a strategy for semantically-valid code. Being based on the grammar I'd guess it's unlikely to find corner cases in the grammar, but it might do so in the parser, or via e.g. `(tree := ast.parse(code)) == ast.parse(ast.unparse(tree))`.

Your distrust of the python_programs() strategy is entirely reasonable, and I'd also want to understand it. I'd just add that it doesn't need to be elegant, or general, or express all inputs... so long as it can find valid inputs that reveal a bug, more easily or cheaply than users or other testing techniques :-)

To test compiler optimizations, yes, once you had the strategy that exact test body would work. An even more effective option is to use "equivalence modulo inputs" [5] to derive `func2` and also assert that `optimizer(func)(arg) == optimizer(func2)(arg)`.

[3] https://pypi.org/project/hypothesmith/ ; in action at https://github.com/psf/black/blob/main/fuzz.py
[4] https://embed.cs.utah.edu/csmith/
[5] http://vuminhle.com/pdf/pldi14-emi.pdf

I hope that helps; in particular Minithesis should take most of the magic out of it.

History
Date	User	Action	Args
2021-05-14 04:21:47	Zac Hatfield-Dodds	set	recipients: + Zac Hatfield-Dodds, gvanrossum, terry.reedy, p-ganssle
2021-05-14 04:21:47	Zac Hatfield-Dodds	set	messageid: <1620966107.54.0.339435983349.issue42109@roundup.psfhosted.org>
2021-05-14 04:21:47	Zac Hatfield-Dodds	link	issue42109 messages
2021-05-14 04:21:45	Zac Hatfield-Dodds	create