Issue 25324: Importing tokenize modifies token

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69511

classification

Title:	Importing tokenize modifies token
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Albert-Jan Nijburg, Mariatta, georg.brandl, josephgordon, martin.panter, meador.inge, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2015-10-06 15:36 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 1608	merged	Albert-Jan Nijburg, 2017-05-16 15:44
PR 1910	merged	Albert-Jan Nijburg, 2017-06-01 20:02
PR 1911	merged	serhiy.storchaka, 2017-06-01 22:33

Messages (24)
msg252397 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-10-06 15:36
Importing the tokenize modifies the content of the token module. >>> import token >>> len(token.tok_name) 59 >>> import tokenize >>> len(token.tok_name) 61 Such side effect looks as bad design to me. Either token.tok_name should contain elements needed for tokenize, or tokenize should not modify token.tok_name.
msg252398 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-10-06 15:39
I agree :-) Do you want to work on a patch?
msg252407 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-10-06 16:39
I left the decision for modules maintainers.
msg252439 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-10-06 23:30
I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7).
msg293786 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-16 21:02
> I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7). token.tok_name is part of the Python public API: https://docs.python.org/dev/library/token.html#token.tok_name whereas tokenize.tok_name isn't documented. So I dislike having two disconnected mappings. I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index().
msg293787 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-16 21:06
Extract of Lib/lib2to3/pgen2/driver.py: if type in (tokenize.COMMENT, tokenize.NL): ... if debug: self.logger.debug("%s %r (prefix=%r)", token.tok_name[type], value, prefix) The code uses tokenize.COMMENT and look for this constant into token.tok_name. If token.tok_name doesn't contain COMMENT anymore, it breaks lib2to3, no? At least the debug mode which might not be covered by test_lib2to3.
msg293789 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-16 21:09
Another example from Tools/i18n/pygettext.py, TokenEater:: def __call__(self, ttype, tstring, stup, etup, line): # dispatch ## import token ## print >> sys.stderr, 'ttype:', token.tok_name[ttype], \ ## 'tstring:', tstring self.__state(ttype, tstring, stup[0]) ... eater.set_filename(filename) try: tokens = tokenize.tokenize(fp.readline) for _token in tokens: eater(*_token) Another example using token.tok_name with token types coming from tokenize.
msg293793 - (view)	Author: Albert-Jan Nijburg (Albert-Jan Nijburg) *	Date: 2017-05-16 21:24
> I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index(). That would make more sense from a breaking change perspective, but we would step on the toes of anyone adding `COMMENT`, `NL`, or `ENCODING` to `token.h` because `token.py` is generated from that. It would also make much more sense to have them as fields on `token` if they are in `tok_name` in `token`.
msg293796 - (view)	Author: Albert-Jan Nijburg (Albert-Jan Nijburg) *	Date: 2017-05-16 21:29
lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING... Lib/lib2to3/pgen2/token.py
msg293799 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-16 21:34
> lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING... Oh you are right: Lib/lib2to3/pgen2/driver.py uses Lib/lib2to3/pgen2/token.py.
msg294345 - (view)	Author: Albert-Jan Nijburg (Albert-Jan Nijburg) *	Date: 2017-05-24 11:16
I've updated the PR and added the tokenize tokens to token.h and their names to tokenizer.c. This way they'll show up when you run token.py. The names will always be in tok_name and tokenizer.py will use those. Not breaking the public api and no longer modifying token.py when you import tokenizer.py.
msg294751 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-30 11:54
https://github.com/python/cpython/pull/1608 doesn't include a Misc/NEWS entry, but I don't want to stress albertjan with Misc/NEWS since I was already annoying enough on the review :-) I'm not sure that the change deserves a NEWS entry. If you consider that yes, it does. It can be done in a separated change.
msg294755 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-05-30 13:32
Albert-Jan's patch should be merged prior my patch for issue30455, because it can be backported, by my patch shouldn't. But first we should compare them and find differences in results. My patch adds COMMENT, NL and ENCODING in the documentation, Albert-Jan's patch doesn't. I don't know whether it is worth to add these names to the token module in bugfix releases, and whether it is worth to document them. My patch wraps all names in _PyParser_TokenNames after <ERRORTOKEN> with angle parenthesis, Albert-Jan's patch inserts new names between <ERRORTOKEN> and <N_TOKENS> without angle parenthesis. I don't know what is more correct.
msg294761 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-30 14:26
I suggest to only modify the master branch. Nobody complained before. It's only a very minor surprising behaviour.
msg294774 - (view)	Author: Albert-Jan Nijburg (Albert-Jan Nijburg) *	Date: 2017-05-30 17:49
Let me know if you want me to add/change anything about my PR :) I'm happy to do so.
msg294785 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-05-30 21:12
If merge this only in master, the new members of the token module should be documented (add a versionchanged directive in token.rst). This also needs a Misc/NEWS entry.
msg294823 - (view)	Author: Albert-Jan Nijburg (Albert-Jan Nijburg) *	Date: 2017-05-31 08:12
I've updated token.rst and Misc/NEWS. Let me know if the wording is correct.
msg294832 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-05-31 11:04
If move COMMENT, NL and ENCODING to the token module, their documentation should be moved from tokenize.rst to token.rst. I have asked on Python-Dev whether this is a right way. [1] [1] https://mail.python.org/pipermail/python-dev/2017-May/148080.html
msg294842 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-31 14:00
New changeset fc354f07855a9197e71f851ad930cbf5652f9160 by Victor Stinner (Albert-Jan Nijburg) in branch 'master': bpo-25324: copy tok_name before changing it (#1608) https://github.com/python/cpython/commit/fc354f07855a9197e71f851ad930cbf5652f9160
msg294957 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-01 17:08
We got a bug report from Coverity: *** CID 1411801: Incorrect expression (MISSING_COMMA) /Parser/tokenizer.c: 111 in () 105 "OP", 106 "AWAIT", 107 "ASYNC", 108 "<ERRORTOKEN>", 109 "COMMENT", 110 "NL", >>> CID 1411801: Incorrect expression (MISSING_COMMA) >>> In the initialization of "_PyParser_TokenNames", a suspicious concatenated string ""ENCODING<N_TOKENS>"" is produced. 111 "ENCODING" 112 "<N_TOKENS>" 113 }; 114 115 116 /* Create and initialize a new tok_state structure */ I missed this typo :-p
msg294958 - (view)	Author: Albert-Jan Nijburg (Albert-Jan Nijburg) *	Date: 2017-06-01 17:20
Aah! Oops I can fix later today. On Thu, 1 Jun 2017 at 18:08, STINNER Victor <report@bugs.python.org> wrote: > > STINNER Victor added the comment: > > We got a bug report from Coverity: > > *** CID 1411801: Incorrect expression (MISSING_COMMA) > /Parser/tokenizer.c: 111 in () > 105 "OP", > 106 "AWAIT", > 107 "ASYNC", > 108 "<ERRORTOKEN>", > 109 "COMMENT", > 110 "NL", > >>> CID 1411801: Incorrect expression (MISSING_COMMA) > >>> In the initialization of "_PyParser_TokenNames", a suspicious > concatenated string ""ENCODING<N_TOKENS>"" is produced. > 111 "ENCODING" > 112 "<N_TOKENS>" > 113 }; > 114 > 115 > 116 /* Create and initialize a new tok_state structure */ > > I missed this typo :-p > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25324> > _______________________________________ >
msg294966 - (view)	Author: Mariatta (Mariatta) *	Date: 2017-06-01 20:51
New changeset c9ccacea3ff441b1ff519c7399602b7db16f9783 by Mariatta (Albert-Jan Nijburg) in branch 'master': bpo-25324: add missing comma in Parser/tokenizer.c (GH-1910) https://github.com/python/cpython/commit/c9ccacea3ff441b1ff519c7399602b7db16f9783
msg294970 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-01 21:12
Albert-Jan Nijburg added the comment: > Aah! Oops I can fix later today. Don't worry. Our test suite and all reviewers also missed it ;-)
msg295267 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-06 15:43
New changeset 5cefb6cfdd089d237ba6724bb5311ee4f04be59f by Serhiy Storchaka in branch 'master': bpo-25324: Move the description of tokenize tokens to token.rst. (#1911) https://github.com/python/cpython/commit/5cefb6cfdd089d237ba6724bb5311ee4f04be59f

History
Date	User	Action	Args
2022-04-11 14:58:22	admin	set	github: 69511
2017-06-06 15:54:21	serhiy.storchaka	set	status: open -> closed stage: resolved resolution: fixed versions: + Python 3.7, - Python 2.7, Python 3.4, Python 3.5, Python 3.6
2017-06-06 15:43:38	serhiy.storchaka	set	messages: + msg295267
2017-06-01 22:33:44	serhiy.storchaka	set	pull_requests: + pull_request1991
2017-06-01 21:12:33	vstinner	set	messages: + msg294970
2017-06-01 20:51:29	Mariatta	set	nosy: + Mariatta messages: + msg294966
2017-06-01 20:02:37	Albert-Jan Nijburg	set	pull_requests: + pull_request1990
2017-06-01 17:20:21	Albert-Jan Nijburg	set	messages: + msg294958
2017-06-01 17:08:45	vstinner	set	messages: + msg294957
2017-05-31 14:00:23	vstinner	set	messages: + msg294842
2017-05-31 11:04:10	serhiy.storchaka	set	messages: + msg294832
2017-05-31 08:12:14	Albert-Jan Nijburg	set	messages: + msg294823
2017-05-30 21:12:20	serhiy.storchaka	set	messages: + msg294785
2017-05-30 17:49:55	Albert-Jan Nijburg	set	messages: + msg294774
2017-05-30 14:26:58	vstinner	set	messages: + msg294761
2017-05-30 13:32:25	serhiy.storchaka	set	messages: + msg294755
2017-05-30 11:54:45	vstinner	set	messages: + msg294751
2017-05-24 11:16:28	Albert-Jan Nijburg	set	messages: + msg294345
2017-05-16 21:34:09	vstinner	set	messages: + msg293799
2017-05-16 21:29:09	Albert-Jan Nijburg	set	messages: + msg293796
2017-05-16 21:24:31	Albert-Jan Nijburg	set	nosy: + Albert-Jan Nijburg messages: + msg293793
2017-05-16 21:09:26	vstinner	set	messages: + msg293789
2017-05-16 21:06:23	vstinner	set	messages: + msg293787
2017-05-16 21:02:54	vstinner	set	messages: + msg293786
2017-05-16 15:44:07	Albert-Jan Nijburg	set	pull_requests: + pull_request1699
2015-12-22 08:41:28	josephgordon	set	nosy: + josephgordon
2015-10-06 23:30:46	martin.panter	set	nosy: + martin.panter messages: + msg252439
2015-10-06 16:39:18	serhiy.storchaka	set	messages: + msg252407
2015-10-06 15:39:00	vstinner	set	nosy: + vstinner messages: + msg252398
2015-10-06 15:36:38	serhiy.storchaka	create