New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing tokenize modifies token #69511
Comments
Importing the tokenize modifies the content of the token module. >>> import token
>>> len(token.tok_name)
59
>>> import tokenize
>>> len(token.tok_name)
61 Such side effect looks as bad design to me. Either token.tok_name should contain elements needed for tokenize, or tokenize should not modify token.tok_name. |
I agree :-) Do you want to work on a patch? |
I left the decision for modules maintainers. |
I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7). |
token.tok_name is part of the Python public API: whereas tokenize.tok_name isn't documented. So I dislike having two disconnected mappings. I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index(). |
Extract of Lib/lib2to3/pgen2/driver.py: if type in (tokenize.COMMENT, tokenize.NL):
...
if debug:
self.logger.debug("%s %r (prefix=%r)",
token.tok_name[type], value, prefix) The code uses tokenize.COMMENT and look for this constant into token.tok_name. If token.tok_name doesn't contain COMMENT anymore, it breaks lib2to3, no? At least the debug mode which might not be covered by test_lib2to3. |
Another example from Tools/i18n/pygettext.py, TokenEater:: def __call__(self, ttype, tstring, stup, etup, line):
# dispatch
## import token
## print >> sys.stderr, 'ttype:', token.tok_name[ttype], \
## 'tstring:', tstring
self.__state(ttype, tstring, stup[0]) ... eater.set_filename(filename)
try:
tokens = tokenize.tokenize(fp.readline)
for _token in tokens:
eater(*_token) Another example using token.tok_name with token types coming from tokenize. |
That would make more sense from a breaking change perspective, but we would step on the toes of anyone adding It would also make much more sense to have them as fields on |
lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING... |
Oh you are right: Lib/lib2to3/pgen2/driver.py uses Lib/lib2to3/pgen2/token.py. |
I've updated the PR and added the tokenize tokens to token.h and their names to tokenizer.c. This way they'll show up when you run token.py. The names will always be in tok_name and tokenizer.py will use those. Not breaking the public api and no longer modifying token.py when you import tokenizer.py. |
Albert-Jan's patch should be merged prior my patch for bpo-30455, because it can be backported, by my patch shouldn't. But first we should compare them and find differences in results. My patch adds COMMENT, NL and ENCODING in the documentation, Albert-Jan's patch doesn't. I don't know whether it is worth to add these names to the token module in bugfix releases, and whether it is worth to document them. My patch wraps all names in _PyParser_TokenNames after <ERRORTOKEN> with angle parenthesis, Albert-Jan's patch inserts new names between <ERRORTOKEN> and <N_TOKENS> without angle parenthesis. I don't know what is more correct. |
I suggest to only modify the master branch. Nobody complained before. It's only a very minor surprising behaviour. |
Let me know if you want me to add/change anything about my PR :) I'm happy to do so. |
If merge this only in master, the new members of the token module should be documented (add a versionchanged directive in token.rst). This also needs a Misc/NEWS entry. |
I've updated token.rst and Misc/NEWS. Let me know if the wording is correct. |
If move COMMENT, NL and ENCODING to the token module, their documentation should be moved from tokenize.rst to token.rst. I have asked on Python-Dev whether this is a right way. [1] [1] https://mail.python.org/pipermail/python-dev/2017-May/148080.html |
We got a bug report from Coverity: *** CID 1411801: Incorrect expression (MISSING_COMMA)
/Parser/tokenizer.c: 111 in ()
105 "OP",
106 "AWAIT",
107 "ASYNC",
108 "<ERRORTOKEN>",
109 "COMMENT",
110 "NL",
>>> CID 1411801: Incorrect expression (MISSING_COMMA)
>>> In the initialization of "_PyParser_TokenNames", a suspicious concatenated string ""ENCODING<N_TOKENS>"" is produced.
111 "ENCODING"
112 "<N_TOKENS>"
113 };
114
115
116 /* Create and initialize a new tok_state structure */ I missed this typo :-p |
Aah! Oops I can fix later today. On Thu, 1 Jun 2017 at 18:08, STINNER Victor <report@bugs.python.org> wrote:
|
New changeset c9ccace by Mariatta (Albert-Jan Nijburg) in branch 'master': |
Albert-Jan Nijburg added the comment:
Don't worry. Our test suite and all reviewers also missed it ;-) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: