Message141503
The tokenize module is happy to tokenize Python source code that the real tokenizer would reject. Pretty much any instance where tokenizer.c returns ERRORTOKEN will illustrate this feature. Here are some examples:
Python 3.3.0a0 (default:2d69900c0820, Aug 1 2011, 13:46:51)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tokenize import generate_tokens
>>> from io import StringIO
>>> def tokens(s):
... """Return a string showing the tokens in the string s."""
... return '|'.join(t[1] for t in generate_tokens(StringIO(s).readline))
...
>>> # Bad exponent
>>> print(tokens('1if 2else 3'))
1|if|2|else|3|
>>> 1if 2else 3
File "<stdin>", line 1
1if 2else 3
^
SyntaxError: invalid token
>>> # Bad hexadecimal constant.
>>> print(tokens('0xfg'))
0xf|g|
>>> 0xfg
File "<stdin>", line 1
0xfg
^
SyntaxError: invalid syntax
>>> # Missing newline after continuation character.
>>> print(tokens('\\pass'))
\|pass|
>>> \pass
File "<stdin>", line 1
\pass
^
SyntaxError: unexpected character after line continuation character
It is surprising that the tokenize module does not yield the same tokens as Python itself, but as this limitation only affects incorrect Python code, perhaps it just needs a mention in the tokenize documentation. Something along the lines of, "The tokenize module generates the same tokens as Python's own tokenizer if it is given correct Python code. However, it may incorrectly tokenize Python code containing syntax errors that the real tokenizer would reject." |
|
Date |
User |
Action |
Args |
2011-08-01 12:58:39 | gdr@garethrees.org | set | recipients:
+ gdr@garethrees.org |
2011-08-01 12:58:39 | gdr@garethrees.org | set | messageid: <1312203519.76.0.918580877349.issue12675@psf.upfronthosting.co.za> |
2011-08-01 12:58:39 | gdr@garethrees.org | link | issue12675 messages |
2011-08-01 12:58:38 | gdr@garethrees.org | create | |
|