classification
Title: Importing tokenize modifies token
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Albert-Jan Nijburg, Mariatta, georg.brandl, haypo, josephgordon, martin.panter, meador.inge, serhiy.storchaka
Priority: normal Keywords:

Created on 2015-10-06 15:36 by serhiy.storchaka, last changed 2017-06-06 15:54 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1608 merged Albert-Jan Nijburg, 2017-05-16 15:44
PR 1910 merged Albert-Jan Nijburg, 2017-06-01 20:02
PR 1911 merged serhiy.storchaka, 2017-06-01 22:33
Messages (24)
msg252397 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-10-06 15:36
Importing the tokenize modifies the content of the token module.

>>> import token
>>> len(token.tok_name)
59
>>> import tokenize
>>> len(token.tok_name)
61

Such side effect looks as bad design to me. Either token.tok_name should contain elements needed for tokenize, or tokenize should not modify token.tok_name.
msg252398 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2015-10-06 15:39
I agree :-) Do you want to work on a patch?
msg252407 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-10-06 16:39
I left the decision for modules maintainers.
msg252439 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-06 23:30
I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7).
msg293786 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-16 21:02
> I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7).

token.tok_name is part of the Python public API:
https://docs.python.org/dev/library/token.html#token.tok_name

whereas tokenize.tok_name isn't documented. So I dislike having two disconnected mappings. I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index().
msg293787 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-16 21:06
Extract of Lib/lib2to3/pgen2/driver.py:

    if type in (tokenize.COMMENT, tokenize.NL):
        ...
    if debug:
        self.logger.debug("%s %r (prefix=%r)",
                          token.tok_name[type], value, prefix)

The code uses tokenize.COMMENT and look for this constant into token.tok_name. If token.tok_name doesn't contain COMMENT anymore, it breaks lib2to3, no? At least the debug mode which might not be covered by test_lib2to3.
msg293789 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-16 21:09
Another example from Tools/i18n/pygettext.py, TokenEater::

    def __call__(self, ttype, tstring, stup, etup, line):
        # dispatch
##        import token
##        print >> sys.stderr, 'ttype:', token.tok_name[ttype], \
##              'tstring:', tstring
        self.__state(ttype, tstring, stup[0])


...

            eater.set_filename(filename)
            try:
                tokens = tokenize.tokenize(fp.readline)
                for _token in tokens:
                    eater(*_token)

Another example using token.tok_name with token types coming from tokenize.
msg293793 - (view) Author: Albert-Jan Nijburg (Albert-Jan Nijburg) * Date: 2017-05-16 21:24
> I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index().

That would make more sense from a breaking change perspective, but we would step on the toes of anyone adding `COMMENT`, `NL`, or `ENCODING` to `token.h` because `token.py` is generated from that.

It would also make much more sense to have them as fields on `token` if they are in `tok_name` in `token`.
msg293796 - (view) Author: Albert-Jan Nijburg (Albert-Jan Nijburg) * Date: 2017-05-16 21:29
lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING... 

Lib/lib2to3/pgen2/token.py
msg293799 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-16 21:34
> lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING...

Oh you are right: Lib/lib2to3/pgen2/driver.py uses Lib/lib2to3/pgen2/token.py.
msg294345 - (view) Author: Albert-Jan Nijburg (Albert-Jan Nijburg) * Date: 2017-05-24 11:16
I've updated the PR and added the tokenize tokens to token.h and their names to tokenizer.c. This way they'll show up when you run token.py. The names will always be in tok_name and tokenizer.py will use those. Not breaking the public api and no longer modifying token.py when you import tokenizer.py.
msg294751 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-30 11:54
https://github.com/python/cpython/pull/1608 doesn't include a Misc/NEWS entry, but I don't want to stress albertjan with Misc/NEWS since I was already annoying enough on the review :-) I'm not sure that the change deserves a NEWS entry. If you consider that yes, it does. It can be done in a separated change.
msg294755 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-30 13:32
Albert-Jan's patch should be merged prior my patch for issue30455, because it can be backported, by my patch shouldn't. But first we should compare them and find differences in results.

My patch adds COMMENT, NL and ENCODING in the documentation, Albert-Jan's patch doesn't. I don't know whether it is worth to add these names to the token module in bugfix releases, and whether it is worth to document them.

My patch wraps all names in _PyParser_TokenNames after <ERRORTOKEN> with angle parenthesis, Albert-Jan's patch inserts new names between <ERRORTOKEN> and <N_TOKENS> without angle parenthesis. I don't know what is more correct.
msg294761 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-30 14:26
I suggest to only modify the master branch. Nobody complained before. It's only a very minor surprising behaviour.
msg294774 - (view) Author: Albert-Jan Nijburg (Albert-Jan Nijburg) * Date: 2017-05-30 17:49
Let me know if you want me to add/change anything about my PR :) I'm happy to do so.
msg294785 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-30 21:12
If merge this only in master, the new members of the token module should be documented (add a versionchanged directive in token.rst). This also needs a Misc/NEWS entry.
msg294823 - (view) Author: Albert-Jan Nijburg (Albert-Jan Nijburg) * Date: 2017-05-31 08:12
I've updated token.rst and Misc/NEWS. Let me know if the wording is correct.
msg294832 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-31 11:04
If move COMMENT, NL and ENCODING to the token module, their documentation should be moved from tokenize.rst to token.rst.

I have asked on Python-Dev whether this is a right way. [1]

[1] https://mail.python.org/pipermail/python-dev/2017-May/148080.html
msg294842 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-31 14:00
New changeset fc354f07855a9197e71f851ad930cbf5652f9160 by Victor Stinner (Albert-Jan Nijburg) in branch 'master':
bpo-25324: copy tok_name before changing it (#1608)
https://github.com/python/cpython/commit/fc354f07855a9197e71f851ad930cbf5652f9160
msg294957 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-06-01 17:08
We got a bug report from Coverity:

*** CID 1411801:  Incorrect expression  (MISSING_COMMA)
/Parser/tokenizer.c: 111 in ()
105         "OP",
106         "AWAIT",
107         "ASYNC",
108         "<ERRORTOKEN>",
109         "COMMENT",
110         "NL",
>>>     CID 1411801:  Incorrect expression  (MISSING_COMMA)
>>>     In the initialization of "_PyParser_TokenNames", a suspicious concatenated string ""ENCODING<N_TOKENS>"" is produced.
111         "ENCODING"
112         "<N_TOKENS>"
113     };
114
115
116     /* Create and initialize a new tok_state structure */

I missed this typo :-p
msg294958 - (view) Author: Albert-Jan Nijburg (Albert-Jan Nijburg) * Date: 2017-06-01 17:20
Aah! Oops I can fix later today.

On Thu, 1 Jun 2017 at 18:08, STINNER Victor <report@bugs.python.org> wrote:

>
> STINNER Victor added the comment:
>
> We got a bug report from Coverity:
>
> *** CID 1411801:  Incorrect expression  (MISSING_COMMA)
> /Parser/tokenizer.c: 111 in ()
> 105         "OP",
> 106         "AWAIT",
> 107         "ASYNC",
> 108         "<ERRORTOKEN>",
> 109         "COMMENT",
> 110         "NL",
> >>>     CID 1411801:  Incorrect expression  (MISSING_COMMA)
> >>>     In the initialization of "_PyParser_TokenNames", a suspicious
> concatenated string ""ENCODING<N_TOKENS>"" is produced.
> 111         "ENCODING"
> 112         "<N_TOKENS>"
> 113     };
> 114
> 115
> 116     /* Create and initialize a new tok_state structure */
>
> I missed this typo :-p
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue25324>
> _______________________________________
>
msg294966 - (view) Author: Mariatta Wijaya (Mariatta) * (Python committer) Date: 2017-06-01 20:51
New changeset c9ccacea3ff441b1ff519c7399602b7db16f9783 by Mariatta (Albert-Jan Nijburg) in branch 'master':
bpo-25324: add missing comma in Parser/tokenizer.c (GH-1910)
https://github.com/python/cpython/commit/c9ccacea3ff441b1ff519c7399602b7db16f9783
msg294970 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-06-01 21:12
Albert-Jan Nijburg added the comment:
> Aah! Oops I can fix later today.

Don't worry. Our test suite and all reviewers also missed it ;-)
msg295267 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-06 15:43
New changeset 5cefb6cfdd089d237ba6724bb5311ee4f04be59f by Serhiy Storchaka in branch 'master':
bpo-25324: Move the description of tokenize tokens to token.rst. (#1911)
https://github.com/python/cpython/commit/5cefb6cfdd089d237ba6724bb5311ee4f04be59f
History
Date User Action Args
2017-06-06 15:54:21serhiy.storchakasetstatus: open -> closed
stage: resolved
resolution: fixed
versions: + Python 3.7, - Python 2.7, Python 3.4, Python 3.5, Python 3.6
2017-06-06 15:43:38serhiy.storchakasetmessages: + msg295267
2017-06-01 22:33:44serhiy.storchakasetpull_requests: + pull_request1991
2017-06-01 21:12:33hayposetmessages: + msg294970
2017-06-01 20:51:29Mariattasetnosy: + Mariatta
messages: + msg294966
2017-06-01 20:02:37Albert-Jan Nijburgsetpull_requests: + pull_request1990
2017-06-01 17:20:21Albert-Jan Nijburgsetmessages: + msg294958
2017-06-01 17:08:45hayposetmessages: + msg294957
2017-05-31 14:00:23hayposetmessages: + msg294842
2017-05-31 11:04:10serhiy.storchakasetmessages: + msg294832
2017-05-31 08:12:14Albert-Jan Nijburgsetmessages: + msg294823
2017-05-30 21:12:20serhiy.storchakasetmessages: + msg294785
2017-05-30 17:49:55Albert-Jan Nijburgsetmessages: + msg294774
2017-05-30 14:26:58hayposetmessages: + msg294761
2017-05-30 13:32:25serhiy.storchakasetmessages: + msg294755
2017-05-30 11:54:45hayposetmessages: + msg294751
2017-05-24 11:16:28Albert-Jan Nijburgsetmessages: + msg294345
2017-05-16 21:34:09hayposetmessages: + msg293799
2017-05-16 21:29:09Albert-Jan Nijburgsetmessages: + msg293796
2017-05-16 21:24:31Albert-Jan Nijburgsetnosy: + Albert-Jan Nijburg
messages: + msg293793
2017-05-16 21:09:26hayposetmessages: + msg293789
2017-05-16 21:06:23hayposetmessages: + msg293787
2017-05-16 21:02:54hayposetmessages: + msg293786
2017-05-16 15:44:07Albert-Jan Nijburgsetpull_requests: + pull_request1699
2015-12-22 08:41:28josephgordonsetnosy: + josephgordon
2015-10-06 23:30:46martin.pantersetnosy: + martin.panter
messages: + msg252439
2015-10-06 16:39:18serhiy.storchakasetmessages: + msg252407
2015-10-06 15:39:00hayposetnosy: + haypo
messages: + msg252398
2015-10-06 15:36:38serhiy.storchakacreate