This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenize module happily tokenizes code with syntax errors
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, daniel.urban, eric.snow, ezio.melotti, flox, gdr@garethrees.org, iritkatriel, meador.inge, r.david.murray, sandro.tosi, terry.reedy, vladris
Priority: normal Keywords:

Created on 2011-08-01 12:58 by gdr@garethrees.org, last changed 2022-04-11 14:57 by admin.

Messages (12)
msg141503 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-01 12:58
The tokenize module is happy to tokenize Python source code that the real tokenizer would reject. Pretty much any instance where tokenizer.c returns ERRORTOKEN will illustrate this feature. Here are some examples:

    Python 3.3.0a0 (default:2d69900c0820, Aug  1 2011, 13:46:51) 
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from tokenize import generate_tokens
    >>> from io import StringIO
    >>> def tokens(s):
    ...    """Return a string showing the tokens in the string s."""
    ...    return '|'.join(t[1] for t in generate_tokens(StringIO(s).readline))
    ...
    >>> # Bad exponent
    >>> print(tokens('1if 2else 3'))
    1|if|2|else|3|
    >>> 1if 2else 3
      File "<stdin>", line 1
        1if 2else 3
             ^
    SyntaxError: invalid token
    >>> # Bad hexadecimal constant.
    >>> print(tokens('0xfg'))
    0xf|g|
    >>> 0xfg
      File "<stdin>", line 1
        0xfg
           ^
    SyntaxError: invalid syntax
    >>> # Missing newline after continuation character.
    >>> print(tokens('\\pass'))
    \|pass|
    >>> \pass 
      File "<stdin>", line 1
        \pass
            ^
    SyntaxError: unexpected character after line continuation character

It is surprising that the tokenize module does not yield the same tokens as Python itself, but as this limitation only affects incorrect Python code, perhaps it just needs a mention in the tokenize documentation. Something along the lines of, "The tokenize module generates the same tokens as Python's own tokenizer if it is given correct Python code. However, it may incorrectly tokenize Python code containing syntax errors that the real tokenizer would reject."
msg141507 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-08-01 13:28
I'm not familiar with the parser internals (I'm nosying someone who is), but I suspect what you are seeing at the command line is the errors being caught at a stage later than the tokenizer.
msg141513 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-01 13:40
These errors are generated directly by the tokenizer. In tokenizer.c, the tokenizer generates ERRORTOKEN when it encounters something it can't tokenize. This causes parsetok() in parsetok.c to stop tokenizing and return an error.
msg141521 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-08-01 14:59
This should probably be fixed (patches welcome). However, note even with valid Python code, the tokens are not the same.
msg141543 - (view) Author: Vlad Riscutia (vladris) Date: 2011-08-02 02:45
How come tokenizer module is not based on actual C tokenizer? Wouldn't that make more sense (and prevent this kind of issues)?
msg141545 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2011-08-02 04:42
tokenize has useful features that the builtin tokenizer does not possess such as the NL token.
msg141624 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-04 11:54
I'm having a look to see if I can make tokenize.py better match the real tokenizer, but I need some feedback on a couple of design decisions. 

First, how to handle tokenization errors? There are three possibilities:

1. Generate an ERRORTOKEN, resynchronize, and continue to tokenize from after the error. This is what tokenize.py currently does in the two cases where it detects an error.

2. Generate an ERRORTOKEN and stop tokenizing. This is what tokenizer.c does.

3. Raise an exception (IndentationError, SyntaxError, or TabError). This is what the user sees when the parser is invoked from pythonrun.c.

Since the documentation for tokenize.py says, "It is designed to match the working of the Python tokenizer exactly", I think that implementing option (2) is best here. (This will mean changing the behaviour of tokenize.py in the two cases where it currently detects an error, so that it stops tokenizing.)

Second, how to record the cause of the error? The real tokenizer records the cause of the error in the 'done' field of the 'tok_state" structure, but tokenize.py loses this information. I propose to add fields to the TokenInfo structure (which is a namedtuple) to record this information. The real tokenizer uses numeric constants from errcode.h (E_TOODEEP, E_TABSPACE, E_DEDENT etc), and pythonrun.c converts these to English-language error messages (E_TOODEEP: "too many levels of indentation"). Both of these pieces of information will be useful, so I propose to add two fields "error" (containing a string like "TOODEEP") and "errormessage" (containing the English-language error message).
msg141625 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-04 12:21
Having looked at some of the consumers of the tokenize module, I don't think my proposed solutions will work.

It seems to be the case that the resynchronization behaviour of tokenize.py is important for consumers that are using it to transform arbitrary Python source code (like 2to3.py). These consumers are relying on the "roundtrip" property that X == untokenize(tokenize(X)). So solution (1) is necessary for the handling of tokenization errors.

Also, that fact that TokenInfo is a 5-tuple is relied on in some places (e.g. lib2to3/patcomp.py line 38), so it can't be extended. And there are consumers (though none in the standard library) that are relying on type=ERRORTOKEN being the way to detect errors in a tokenization stream. So I can't overload that field of the structure.

Any good ideas for how to record the cause of error without breaking backwards compatibility?
msg141626 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-04 12:26
Ah ... TokenInfo is a *subclass* of namedtuple, so I can add extra properties to it without breaking consumers that expect it to be a 5-tuple.
msg141680 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-05 19:53
I have not used tokenize, but if it is *not* intended to exactly reproduce the internal tokenizer  behavior, the claim that it is should be amended.
msg141690 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-05 21:11
Terry: agreed. Does anyone actually use this module? Does anyone know what the design goals are for tokenize? If someone can tell me, I'll do my best to make it meet them.

Meanwhile, here's another bug. Each character of trailing whitespace is tokenized as an ERRORTOKEN.

    Python 3.3.0a0 (default:c099ba0a278e, Aug  2 2011, 12:35:03) 
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from tokenize import tokenize,untokenize
    >>> from io import BytesIO
    >>> list(tokenize(BytesIO('1 '.encode('utf8')).readline))
    [TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''), TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 '), TokenInfo(type=54 (ERRORTOKEN), string=' ', start=(1, 1), end=(1, 2), line='1 '), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
msg404672 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-10-21 21:42
Reproduced on 3.11.
History
Date User Action Args
2022-04-11 14:57:20adminsetgithub: 56884
2021-10-21 21:42:42iritkatrielsetnosy: + iritkatriel

messages: + msg404672
versions: + Python 3.11, - Python 2.7, Python 3.3, Python 3.4
2014-02-08 20:15:10terry.reedysetversions: + Python 3.4, - Python 3.2
2011-10-22 19:52:56floxsetnosy: + flox
2011-10-22 19:15:57meador.ingesetnosy: + meador.inge
2011-08-05 21:25:12sandro.tosisetnosy: + sandro.tosi
2011-08-05 21:11:50gdr@garethrees.orgsetmessages: + msg141690
2011-08-05 19:53:34terry.reedysetnosy: + terry.reedy
messages: + msg141680
2011-08-05 19:33:28daniel.urbansetnosy: + daniel.urban
2011-08-04 12:26:06gdr@garethrees.orgsetmessages: + msg141626
2011-08-04 12:21:05gdr@garethrees.orgsetmessages: + msg141625
2011-08-04 11:54:04gdr@garethrees.orgsetmessages: + msg141624
2011-08-02 04:42:17benjamin.petersonsetmessages: + msg141545
2011-08-02 02:45:32vladrissetnosy: + vladris
messages: + msg141543
2011-08-01 16:30:17eric.snowsetnosy: + eric.snow
2011-08-01 15:11:46ezio.melottisetstage: test needed
2011-08-01 15:11:32ezio.melottisetnosy: + ezio.melotti

versions: + Python 2.7, Python 3.2
2011-08-01 14:59:02benjamin.petersonsetmessages: + msg141521
2011-08-01 13:40:41gdr@garethrees.orgsetmessages: + msg141513
2011-08-01 13:28:08r.david.murraysetnosy: + r.david.murray, benjamin.peterson
messages: + msg141507
2011-08-01 12:58:39gdr@garethrees.orgcreate