New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new attribute to TokenInfo to report specific token IDs #46387
Comments
function generate_tokes at tokenize.py yields token OP (51) for colon, I'm attaching a minor sample that demonstrates this, running it returns 1 'if' (1, 0) (1, 2) if a == 2: I didn't check if there are problems with other tokens, I noticed this |
I'm attaching a patch that solves this and updates tests. |
Unfortunately I think this will break many users of tokenize.py. e.g. http://browsershots.googlecode.com/svn/trunk/devtools/pep8/pep8.py If tokenize now returns LPAR, this code will no longer work correctly. |
I have not looked at this, but a new parameter would be a new feature. Its a moot point until there is an agreed on patch for a current version. |
There are a *lot* of characters with semantic significance that are reported by the tokenize module as generic "OP" tokens: token.LPAR However, I can't fault tokenize for deciding to treat all of those tokens the same way - for many source code manipulation purposes, these just need to be transcribed literally, and the "OP" token serves that purpose just fine. As the extensive test updates in the current patch suggest, AMK is also correct that changing this away from always returning "OP" tokens (even for characters with more specialised tokens available) would be a backwards incompatible change. I think there are two parts to this problem, one documentation related (affecting 2.7, 3.2, 3.3) and another that would be an actual change in 3.3:
|
I believe that that list includes all symbols and symbol combinations that are syntactically significant in expressions. This is the generalized meaning of 'operator' that is being used. What do not appear are '#' which marks comments, '_' which is a name char, and '\' which escapes chars within strings. Other symbols within strings will also not be marked as OP tokens. The non-syntactic symbols '$' and '?' are also omitted. |
Sure, but what does that have to do with anything? tokenize isn't a general purpose tokenizer, it's specifically for tokenizing Python source code. The *problem* is that it doesn't currently fully tokenize everything, but doesn't explicitly say that in the module documentation. Hence my proposed two-fold fix: document the current behaviour explicitly and also add a separate "exact_type" attribute for easy access to the detailed tokenization without doing your own string comparisons. |
If you are responding to me, I am baffled. I gave a concise way to document the current behavior with respect to .OP, which you said you wanted. |
Ah, I didn't read it as suggested documentation at all - you moved seamlessly from personal commentary to a docs suggestion without separating the two, so it appeared to be a complete non sequitur to me. As for the docs suggestion, I think it works as the explanation of which tokens are affected once the concept of the token stream simplification is introduced: The affected tokens are all symbols and symbol combinations that are syntactically significant in expressions (as listed in the token module). Anything which is not an independent token (i.e. '#' which marks comments, '_' which is just part of a name, '\' which is used for line continuations, the contents of string literals and any symbols which are not a defined part of Python's syntax) is completely unaffected by this difference in behaviour. If "exact_type" is introduced in 3.3, then the first paragraph can be adjusted accordingly. |
Both the proposed text and 3.3 addition look good to me. |
The proposed documentation text seems too complicated and language expert speaky to me. We should try to link to standard definitions when possible to reduce the text here. For example, I believe the "Operators" and "Delimiters" tokens in the "Lexical Analysis" section of the docs (http://docs.python.org/dev/reference/lexical_analysis.html#operators) are exactly what we are trying to describe when referencing "literal tokens" and "affected tokens". I like Nick's idea to introduce a new attribute for the exact type, while keeping the tuple structure itself backwards compatible. Attached is a patch for 3.3 that updates the docs, adds exact_type, adds new unit tests, and adds a new CLI option for displaying token names using the exact type. An example of the new CLI option is: $ echo '1+2**4' | ./python -m tokenize
1,0-1,1: NUMBER '1'
1,1-1,2: OP '+'
1,2-1,3: NUMBER '2'
1,3-1,5: OP '**'
1,5-1,6: NUMBER '4'
1,6-1,7: NEWLINE '\n'
2,0-2,0: ENDMARKER ''
$ echo '1+2**4' | ./python -m tokenize -e
1,0-1,1: NUMBER '1'
1,1-1,2: PLUS '+'
1,2-1,3: NUMBER '2'
1,3-1,5: DOUBLESTAR '**'
1,5-1,6: NUMBER '4'
1,6-1,7: NEWLINE '\n'
2,0-2,0: ENDMARKER '' |
Meador's patch looks good to me. The docs change for 2.7 and 3.2 would be similar, just with text like "Specific tokens can be distinguished by checking the |
The cmdoption directive should be used with a program directive. See library/trace for an example of how to use it and to see the anchors and index entries it generates. |
Ah, nice. Thanks for the tip Éric. Updated patch attached along with a patch for the 2.7/3.2 doc update attached. |
New changeset 75baef657770 by Meador Inge in branch '2.7': New changeset dfd74d752b0e by Meador Inge in branch '3.2': New changeset f4976fa6e830 by Meador Inge in branch 'default': |
Fixed. Thanks for the reviews everyone. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: