Message 383384 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	esoma
Recipients	esoma
Date	2020-12-19.15:46:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1608392787.93.0.861284496438.issue42687@roundup.psfhosted.org>
In-reply-to

Content
'<>' is not recognized by the tokenize module as a single token, instead it is two tokens. ``` $ python -c "import tokenize; import io; import pprint; pprint.pprint(list(tokenize.tokenize(io.BytesIO(b'<>').readline)))" [TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''), TokenInfo(type=54 (OP), string='<', start=(1, 0), end=(1, 1), line='<>'), TokenInfo(type=54 (OP), string='>', start=(1, 1), end=(1, 2), line='<>'), TokenInfo(type=4 (NEWLINE), string='', start=(1, 2), end=(1, 3), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')] ``` I would expect: ``` [TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''), TokenInfo(type=54 (OP), string='<>', start=(1, 0), end=(1, 2), line='<>'), TokenInfo(type=4 (NEWLINE), string='', start=(1, 2), end=(1, 3), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')] ``` This is the behavior of the CPython tokenizer which the tokenizer module tries "to match the working of".

'<>' is not recognized by the tokenize module as a single token, instead it is two tokens.

```
$ python -c "import tokenize; import io; import pprint; pprint.pprint(list(tokenize.tokenize(io.BytesIO(b'<>').readline)))"
[TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=54 (OP), string='<', start=(1, 0), end=(1, 1), line='<>'),
 TokenInfo(type=54 (OP), string='>', start=(1, 1), end=(1, 2), line='<>'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 2), end=(1, 3), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
```


I would expect:
```
[TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=54 (OP), string='<>', start=(1, 0), end=(1, 2), line='<>'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 2), end=(1, 3), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
```

This is the behavior of the CPython tokenizer which the tokenizer module tries "to match the working of".

History
Date	User	Action	Args
2020-12-19 15:46:27	esoma	set	recipients: + esoma
2020-12-19 15:46:27	esoma	set	messageid: <1608392787.93.0.861284496438.issue42687@roundup.psfhosted.org>
2020-12-19 15:46:27	esoma	link	issue42687 messages
2020-12-19 15:46:27	esoma	create