Importing tokenize modifies token #69511

serhiy-storchaka · 2015-10-06T15:36:39Z

BPO	25324
Nosy	@birkenfeld, @vstinner, @meadori, @vadmium, @serhiy-storchaka, @Mariatta, @albertjan
PRs	bpo-25324: copy tok_name before changing it #1608 bpo-25324: add missing comma #1910 bpo-25324: Move the description of tokenize tokens to token.rst. #1911

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-06-06.15:54:21.491>
created_at = <Date 2015-10-06.15:36:38.742>
labels = ['3.7', 'type-bug', 'library']
title = 'Importing tokenize modifies token'
updated_at = <Date 2017-06-06.15:54:21.489>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2017-06-06.15:54:21.489>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2017-06-06.15:54:21.491>
closer = 'serhiy.storchaka'
components = ['Library (Lib)']
creation = <Date 2015-10-06.15:36:38.742>
creator = 'serhiy.storchaka'
dependencies = []
files = []
hgrepos = []
issue_num = 25324
keywords = []
message_count = 24.0
messages = ['252397', '252398', '252407', '252439', '293786', '293787', '293789', '293793', '293796', '293799', '294345', '294751', '294755', '294761', '294774', '294785', '294823', '294832', '294842', '294957', '294958', '294966', '294970', '295267']
nosy_count = 8.0
nosy_names = ['georg.brandl', 'vstinner', 'meador.inge', 'martin.panter', 'serhiy.storchaka', 'josephgordon', 'Mariatta', 'Albert-Jan Nijburg']
pr_nums = ['1608', '1910', '1911']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25324'
versions = ['Python 3.7']

serhiy-storchaka · 2015-10-06T15:36:38Z

Importing the tokenize modifies the content of the token module.

>>> import token
>>> len(token.tok_name)
59
>>> import tokenize
>>> len(token.tok_name)
61

Such side effect looks as bad design to me. Either token.tok_name should contain elements needed for tokenize, or tokenize should not modify token.tok_name.

vstinner · 2015-10-06T15:39:00Z

I agree :-) Do you want to work on a patch?

serhiy-storchaka · 2015-10-06T16:39:19Z

I left the decision for modules maintainers.

vadmium · 2015-10-06T23:30:46Z

I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7).

vstinner · 2017-05-16T21:02:54Z

I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7).

token.tok_name is part of the Python public API:
https://docs.python.org/dev/library/token.html#token.tok_name

whereas tokenize.tok_name isn't documented. So I dislike having two disconnected mappings. I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index().

vstinner · 2017-05-16T21:06:24Z

Extract of Lib/lib2to3/pgen2/driver.py:

    if type in (tokenize.COMMENT, tokenize.NL):
        ...
    if debug:
        self.logger.debug("%s %r (prefix=%r)",
                          token.tok_name[type], value, prefix)

The code uses tokenize.COMMENT and look for this constant into token.tok_name. If token.tok_name doesn't contain COMMENT anymore, it breaks lib2to3, no? At least the debug mode which might not be covered by test_lib2to3.

vstinner · 2017-05-16T21:09:27Z

Another example from Tools/i18n/pygettext.py, TokenEater::

    def __call__(self, ttype, tstring, stup, etup, line):
        # dispatch
##        import token
##        print >> sys.stderr, 'ttype:', token.tok_name[ttype], \
##              'tstring:', tstring
        self.__state(ttype, tstring, stup[0])

...

            eater.set_filename(filename)
            try:
                tokens = tokenize.tokenize(fp.readline)
                for _token in tokens:
                    eater(*_token)

Another example using token.tok_name with token types coming from tokenize.

albertjan · 2017-05-16T21:24:32Z

I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index().

That would make more sense from a breaking change perspective, but we would step on the toes of anyone adding COMMENT, NL, or ENCODING to token.h because token.py is generated from that.

It would also make much more sense to have them as fields on token if they are in tok_name in token.

albertjan · 2017-05-16T21:29:10Z

lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING...

Lib/lib2to3/pgen2/token.py

vstinner · 2017-05-16T21:34:09Z

lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING...

Oh you are right: Lib/lib2to3/pgen2/driver.py uses Lib/lib2to3/pgen2/token.py.

albertjan · 2017-05-24T11:16:28Z

I've updated the PR and added the tokenize tokens to token.h and their names to tokenizer.c. This way they'll show up when you run token.py. The names will always be in tok_name and tokenizer.py will use those. Not breaking the public api and no longer modifying token.py when you import tokenizer.py.

vstinner · 2017-05-30T11:54:45Z

#1608 doesn't include a Misc/NEWS entry, but I don't want to stress albertjan with Misc/NEWS since I was already annoying enough on the review :-) I'm not sure that the change deserves a NEWS entry. If you consider that yes, it does. It can be done in a separated change.

serhiy-storchaka · 2017-05-30T13:32:25Z

Albert-Jan's patch should be merged prior my patch for bpo-30455, because it can be backported, by my patch shouldn't. But first we should compare them and find differences in results.

My patch adds COMMENT, NL and ENCODING in the documentation, Albert-Jan's patch doesn't. I don't know whether it is worth to add these names to the token module in bugfix releases, and whether it is worth to document them.

My patch wraps all names in _PyParser_TokenNames after <ERRORTOKEN> with angle parenthesis, Albert-Jan's patch inserts new names between <ERRORTOKEN> and <N_TOKENS> without angle parenthesis. I don't know what is more correct.

vstinner · 2017-05-30T14:26:59Z

I suggest to only modify the master branch. Nobody complained before. It's only a very minor surprising behaviour.

albertjan · 2017-05-30T17:49:56Z

Let me know if you want me to add/change anything about my PR :) I'm happy to do so.

serhiy-storchaka · 2017-05-30T21:12:21Z

If merge this only in master, the new members of the token module should be documented (add a versionchanged directive in token.rst). This also needs a Misc/NEWS entry.

albertjan · 2017-05-31T08:12:14Z

I've updated token.rst and Misc/NEWS. Let me know if the wording is correct.

serhiy-storchaka · 2017-05-31T11:04:10Z

If move COMMENT, NL and ENCODING to the token module, their documentation should be moved from tokenize.rst to token.rst.

I have asked on Python-Dev whether this is a right way. [1]

[1] https://mail.python.org/pipermail/python-dev/2017-May/148080.html

vstinner · 2017-05-31T14:00:24Z

New changeset fc354f0 by Victor Stinner (Albert-Jan Nijburg) in branch 'master':
bpo-25324: copy tok_name before changing it (bpo-1608)
fc354f0

vstinner · 2017-06-01T17:08:45Z

We got a bug report from Coverity:

*** CID 1411801:  Incorrect expression  (MISSING_COMMA)
/Parser/tokenizer.c: 111 in ()
105         "OP",
106         "AWAIT",
107         "ASYNC",
108         "<ERRORTOKEN>",
109         "COMMENT",
110         "NL",
>>>     CID 1411801:  Incorrect expression  (MISSING_COMMA)
>>>     In the initialization of "_PyParser_TokenNames", a suspicious concatenated string ""ENCODING<N_TOKENS>"" is produced.
111         "ENCODING"
112         "<N_TOKENS>"
113     };
114
115
116     /* Create and initialize a new tok_state structure */

I missed this typo :-p

albertjan · 2017-06-01T17:20:22Z

Aah! Oops I can fix later today.

On Thu, 1 Jun 2017 at 18:08, STINNER Victor <report@bugs.python.org> wrote:

STINNER Victor added the comment:

We got a bug report from Coverity:

*** CID 1411801: Incorrect expression (MISSING_COMMA)
/Parser/tokenizer.c: 111 in ()
105 "OP",
106 "AWAIT",
107 "ASYNC",
108 "<ERRORTOKEN>",
109 "COMMENT",
110 "NL",
>>> CID 1411801: Incorrect expression (MISSING_COMMA)
>>> In the initialization of "_PyParser_TokenNames", a suspicious
concatenated string ""ENCODING<N_TOKENS>"" is produced.
111 "ENCODING"
112 "<N_TOKENS>"
113 };
114
115
116 /* Create and initialize a new tok_state structure */

I missed this typo :-p

----------

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue25324\>

Mariatta · 2017-06-01T20:51:30Z

New changeset c9ccace by Mariatta (Albert-Jan Nijburg) in branch 'master':
bpo-25324: add missing comma in Parser/tokenizer.c (GH-1910)
c9ccace

vstinner · 2017-06-01T21:12:34Z

Albert-Jan Nijburg added the comment:

Aah! Oops I can fix later today.

Don't worry. Our test suite and all reviewers also missed it ;-)

serhiy-storchaka · 2017-06-06T15:43:38Z

New changeset 5cefb6c by Serhiy Storchaka in branch 'master':
bpo-25324: Move the description of tokenize tokens to token.rst. (bpo-1911)
5cefb6c

serhiy-storchaka added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 6, 2015

serhiy-storchaka added the 3.7 (EOL) end of life label Jun 6, 2017

serhiy-storchaka closed this as completed Jun 6, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing tokenize modifies token #69511

Importing tokenize modifies token #69511

serhiy-storchaka commented Oct 6, 2015

serhiy-storchaka commented Oct 6, 2015

vstinner commented Oct 6, 2015

serhiy-storchaka commented Oct 6, 2015

vadmium commented Oct 6, 2015

vstinner commented May 16, 2017

vstinner commented May 16, 2017

vstinner commented May 16, 2017

albertjan mannequin commented May 16, 2017

albertjan mannequin commented May 16, 2017

vstinner commented May 16, 2017

albertjan mannequin commented May 24, 2017

vstinner commented May 30, 2017

serhiy-storchaka commented May 30, 2017

vstinner commented May 30, 2017

albertjan mannequin commented May 30, 2017

serhiy-storchaka commented May 30, 2017

albertjan mannequin commented May 31, 2017

serhiy-storchaka commented May 31, 2017

vstinner commented May 31, 2017

vstinner commented Jun 1, 2017

albertjan mannequin commented Jun 1, 2017

Mariatta commented Jun 1, 2017

vstinner commented Jun 1, 2017

serhiy-storchaka commented Jun 6, 2017

Importing tokenize modifies token #69511

Importing tokenize modifies token #69511

Comments

serhiy-storchaka commented Oct 6, 2015

serhiy-storchaka commented Oct 6, 2015

vstinner commented Oct 6, 2015

serhiy-storchaka commented Oct 6, 2015

vadmium commented Oct 6, 2015

vstinner commented May 16, 2017

vstinner commented May 16, 2017

vstinner commented May 16, 2017

albertjan mannequin commented May 16, 2017

albertjan mannequin commented May 16, 2017

vstinner commented May 16, 2017

albertjan mannequin commented May 24, 2017

vstinner commented May 30, 2017

serhiy-storchaka commented May 30, 2017

vstinner commented May 30, 2017

albertjan mannequin commented May 30, 2017

serhiy-storchaka commented May 30, 2017

albertjan mannequin commented May 31, 2017

serhiy-storchaka commented May 31, 2017

vstinner commented May 31, 2017

vstinner commented Jun 1, 2017

albertjan mannequin commented Jun 1, 2017

Mariatta commented Jun 1, 2017

vstinner commented Jun 1, 2017

serhiy-storchaka commented Jun 6, 2017