Title: Untokenize and retokenize does not round-trip
Components: Library (Lib) Versions: Python 3.8
Nosy List: Zac Hatfield-Dodds, meador.inge, terry.reedy
Created on 2019-12-02 11:06 by Zac Hatfield-Dodds

Author: Zac Hatfield-Dodds (Zac Hatfield-Dodds) Date: 2019-12-02 11:06
I've been working on a tool called Hypothesmith - - to generate arbitrary Python source code, inspired by CSmith's success in finding C compiler bugs.  It's based on the grammar but ultimately only generates strings which `compile` accepts; this is the only way I know to answer the question "is the string valid Python"!

I should be clear that I don't think the minimal examples are representative of real problems that users may encounter!  However, fuzzing is very effective at finding important bugs if we can get these apparently-trivial ones out of the way by changing either the code or the test :-)

def test_tokenize_round_trip_string(source_code):
    tokens = list(tokenize.generate_tokens(io.StringIO(source_code).readline))
    outstring = tokenize.untokenize(tokens)  # may have changed whitespace from source
    output = tokenize.generate_tokens(io.StringIO(outstring).readline)
    assert [(t.type, t.string) for t in tokens] == [(t.type, t.string) for t in output]

Each of the `@example` cases are accepted by `compile` but fail the test; the `@given` case describes how to generate more such strings.  You can read more details in the Hypothesmith repo if interested.

I think these are real and probably unimportant bugs, but I'd love to start a conversation about what properties should *always* hold for functions dealing with Python source code - and how best to report research results if I can demonstrate that they don't!

(for example, lib2to3 has many similar failures but I don't want to open a long list of low-value issues)
