classification
Title: tokenize module does not recognize Barry as FLUFL
Type: enhancement Stage: resolved
Components: Versions: Python 3.10
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, esoma, terry.reedy
Priority: normal Keywords: patch

Created on 2020-12-19 15:46 by esoma, last changed 2020-12-27 01:05 by esoma. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 23857 closed esoma, 2020-12-19 16:01
Messages (3)
msg383384 - (view) Author: Erik Soma (esoma) * Date: 2020-12-19 15:46
'<>' is not recognized by the tokenize module as a single token, instead it is two tokens.

```
$ python -c "import tokenize; import io; import pprint; pprint.pprint(list(tokenize.tokenize(io.BytesIO(b'<>').readline)))"
[TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=54 (OP), string='<', start=(1, 0), end=(1, 1), line='<>'),
 TokenInfo(type=54 (OP), string='>', start=(1, 1), end=(1, 2), line='<>'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 2), end=(1, 3), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
```


I would expect:
```
[TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=54 (OP), string='<>', start=(1, 0), end=(1, 2), line='<>'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 2), end=(1, 3), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
```

This is the behavior of the CPython tokenizer which the tokenizer module tries "to match the working of".
msg383787 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-12-26 00:56
I strongly disagree.  '<>' is not a legal operator any more.  It is a parse-time syntax error.  Whatever historical artifact is left in the CPython tokenizer, recognizing '<>' is not exposed to Python code.

>>> p = ast.parse('a <> b')
Traceback (most recent call last):
...
    a <> b
    ^
SyntaxError: invalid syntax  

When '<>' was legal, we may presume that tokenizer recognized it, so that not recognizing it was an intentional change.  Reverting this would be a dis-service to users.  

I think that the PR and this issue should be closed.  If the historical artifact bothers you, propose removing it instead on introducing a bug into tokenizer.
msg383794 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2020-12-26 07:13
I concur with Terry.
History
Date User Action Args
2020-12-27 01:05:55esomasetstatus: open -> closed
stage: patch review -> resolved
2020-12-26 07:13:41BTaskayasetnosy: + BTaskaya
messages: + msg383794
2020-12-26 00:56:19terry.reedysetversions: - Python 3.9
nosy: + terry.reedy

messages: + msg383787

type: enhancement
2020-12-19 16:01:57esomasetkeywords: + patch
stage: patch review
pull_requests: + pull_request22722
2020-12-19 15:46:27esomacreate