Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing tokenize modifies token #69511

Closed
serhiy-storchaka opened this issue Oct 6, 2015 · 24 comments
Closed

Importing tokenize modifies token #69511

serhiy-storchaka opened this issue Oct 6, 2015 · 24 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 25324
Nosy @birkenfeld, @vstinner, @meadori, @vadmium, @serhiy-storchaka, @Mariatta, @albertjan
PRs
  • bpo-25324: copy tok_name before changing it #1608
  • bpo-25324: add missing comma #1910
  • bpo-25324: Move the description of tokenize tokens to token.rst. #1911
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-06-06.15:54:21.491>
    created_at = <Date 2015-10-06.15:36:38.742>
    labels = ['3.7', 'type-bug', 'library']
    title = 'Importing tokenize modifies token'
    updated_at = <Date 2017-06-06.15:54:21.489>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2017-06-06.15:54:21.489>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-06-06.15:54:21.491>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2015-10-06.15:36:38.742>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 25324
    keywords = []
    message_count = 24.0
    messages = ['252397', '252398', '252407', '252439', '293786', '293787', '293789', '293793', '293796', '293799', '294345', '294751', '294755', '294761', '294774', '294785', '294823', '294832', '294842', '294957', '294958', '294966', '294970', '295267']
    nosy_count = 8.0
    nosy_names = ['georg.brandl', 'vstinner', 'meador.inge', 'martin.panter', 'serhiy.storchaka', 'josephgordon', 'Mariatta', 'Albert-Jan Nijburg']
    pr_nums = ['1608', '1910', '1911']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25324'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    Importing the tokenize modifies the content of the token module.

    >>> import token
    >>> len(token.tok_name)
    59
    >>> import tokenize
    >>> len(token.tok_name)
    61

    Such side effect looks as bad design to me. Either token.tok_name should contain elements needed for tokenize, or tokenize should not modify token.tok_name.

    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 6, 2015
    @vstinner
    Copy link
    Member

    vstinner commented Oct 6, 2015

    I agree :-) Do you want to work on a patch?

    @serhiy-storchaka
    Copy link
    Member Author

    I left the decision for modules maintainers.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 6, 2015

    I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7).

    @vstinner
    Copy link
    Member

    I would fix this by making tokenize.tok_name a copy. It looks like this behaviour dates back to 1997 (see revision 1efc4273fdb7).

    token.tok_name is part of the Python public API:
    https://docs.python.org/dev/library/token.html#token.tok_name

    whereas tokenize.tok_name isn't documented. So I dislike having two disconnected mappings. I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index().

    @vstinner
    Copy link
    Member

    Extract of Lib/lib2to3/pgen2/driver.py:

        if type in (tokenize.COMMENT, tokenize.NL):
            ...
        if debug:
            self.logger.debug("%s %r (prefix=%r)",
                              token.tok_name[type], value, prefix)

    The code uses tokenize.COMMENT and look for this constant into token.tok_name. If token.tok_name doesn't contain COMMENT anymore, it breaks lib2to3, no? At least the debug mode which might not be covered by test_lib2to3.

    @vstinner
    Copy link
    Member

    Another example from Tools/i18n/pygettext.py, TokenEater::

        def __call__(self, ttype, tstring, stup, etup, line):
            # dispatch
    ##        import token
    ##        print >> sys.stderr, 'ttype:', token.tok_name[ttype], \
    ##              'tstring:', tstring
            self.__state(ttype, tstring, stup[0])

    ...

                eater.set_filename(filename)
                try:
                    tokens = tokenize.tokenize(fp.readline)
                    for _token in tokens:
                        eater(*_token)

    Another example using token.tok_name with token types coming from tokenize.

    @albertjan
    Copy link
    Mannequin

    albertjan mannequin commented May 16, 2017

    I prefer to add tokenize tokens directly in Lib/token.py, and then get COMMENT, NL and ENCODING using tok_name.index().

    That would make more sense from a breaking change perspective, but we would step on the toes of anyone adding COMMENT, NL, or ENCODING to token.h because token.py is generated from that.

    It would also make much more sense to have them as fields on token if they are in tok_name in token.

    @albertjan
    Copy link
    Mannequin

    albertjan mannequin commented May 16, 2017

    lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING...

    Lib/lib2to3/pgen2/token.py

    @vstinner
    Copy link
    Member

    lib2to3 appears to have it's own token.py as well with NL and COMMENT withtout ENCODING...

    Oh you are right: Lib/lib2to3/pgen2/driver.py uses Lib/lib2to3/pgen2/token.py.

    @albertjan
    Copy link
    Mannequin

    albertjan mannequin commented May 24, 2017

    I've updated the PR and added the tokenize tokens to token.h and their names to tokenizer.c. This way they'll show up when you run token.py. The names will always be in tok_name and tokenizer.py will use those. Not breaking the public api and no longer modifying token.py when you import tokenizer.py.

    @vstinner
    Copy link
    Member

    #1608 doesn't include a Misc/NEWS entry, but I don't want to stress albertjan with Misc/NEWS since I was already annoying enough on the review :-) I'm not sure that the change deserves a NEWS entry. If you consider that yes, it does. It can be done in a separated change.

    @serhiy-storchaka
    Copy link
    Member Author

    Albert-Jan's patch should be merged prior my patch for bpo-30455, because it can be backported, by my patch shouldn't. But first we should compare them and find differences in results.

    My patch adds COMMENT, NL and ENCODING in the documentation, Albert-Jan's patch doesn't. I don't know whether it is worth to add these names to the token module in bugfix releases, and whether it is worth to document them.

    My patch wraps all names in _PyParser_TokenNames after <ERRORTOKEN> with angle parenthesis, Albert-Jan's patch inserts new names between <ERRORTOKEN> and <N_TOKENS> without angle parenthesis. I don't know what is more correct.

    @vstinner
    Copy link
    Member

    I suggest to only modify the master branch. Nobody complained before. It's only a very minor surprising behaviour.

    @albertjan
    Copy link
    Mannequin

    albertjan mannequin commented May 30, 2017

    Let me know if you want me to add/change anything about my PR :) I'm happy to do so.

    @serhiy-storchaka
    Copy link
    Member Author

    If merge this only in master, the new members of the token module should be documented (add a versionchanged directive in token.rst). This also needs a Misc/NEWS entry.

    @albertjan
    Copy link
    Mannequin

    albertjan mannequin commented May 31, 2017

    I've updated token.rst and Misc/NEWS. Let me know if the wording is correct.

    @serhiy-storchaka
    Copy link
    Member Author

    If move COMMENT, NL and ENCODING to the token module, their documentation should be moved from tokenize.rst to token.rst.

    I have asked on Python-Dev whether this is a right way. [1]

    [1] https://mail.python.org/pipermail/python-dev/2017-May/148080.html

    @vstinner
    Copy link
    Member

    New changeset fc354f0 by Victor Stinner (Albert-Jan Nijburg) in branch 'master':
    bpo-25324: copy tok_name before changing it (bpo-1608)
    fc354f0

    @vstinner
    Copy link
    Member

    vstinner commented Jun 1, 2017

    We got a bug report from Coverity:

    *** CID 1411801:  Incorrect expression  (MISSING_COMMA)
    /Parser/tokenizer.c: 111 in ()
    105         "OP",
    106         "AWAIT",
    107         "ASYNC",
    108         "<ERRORTOKEN>",
    109         "COMMENT",
    110         "NL",
    >>>     CID 1411801:  Incorrect expression  (MISSING_COMMA)
    >>>     In the initialization of "_PyParser_TokenNames", a suspicious concatenated string ""ENCODING<N_TOKENS>"" is produced.
    111         "ENCODING"
    112         "<N_TOKENS>"
    113     };
    114
    115
    116     /* Create and initialize a new tok_state structure */

    I missed this typo :-p

    @albertjan
    Copy link
    Mannequin

    albertjan mannequin commented Jun 1, 2017

    Aah! Oops I can fix later today.

    On Thu, 1 Jun 2017 at 18:08, STINNER Victor <report@bugs.python.org> wrote:

    STINNER Victor added the comment:

    We got a bug report from Coverity:

    *** CID 1411801: Incorrect expression (MISSING_COMMA)
    /Parser/tokenizer.c: 111 in ()
    105 "OP",
    106 "AWAIT",
    107 "ASYNC",
    108 "<ERRORTOKEN>",
    109 "COMMENT",
    110 "NL",
    >>> CID 1411801: Incorrect expression (MISSING_COMMA)
    >>> In the initialization of "_PyParser_TokenNames", a suspicious
    concatenated string ""ENCODING<N_TOKENS>"" is produced.
    111 "ENCODING"
    112 "<N_TOKENS>"
    113 };
    114
    115
    116 /* Create and initialize a new tok_state structure */

    I missed this typo :-p

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue25324\>


    @Mariatta
    Copy link
    Member

    Mariatta commented Jun 1, 2017

    New changeset c9ccace by Mariatta (Albert-Jan Nijburg) in branch 'master':
    bpo-25324: add missing comma in Parser/tokenizer.c (GH-1910)
    c9ccace

    @vstinner
    Copy link
    Member

    vstinner commented Jun 1, 2017

    Albert-Jan Nijburg added the comment:

    Aah! Oops I can fix later today.

    Don't worry. Our test suite and all reviewers also missed it ;-)

    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 5cefb6c by Serhiy Storchaka in branch 'master':
    bpo-25324: Move the description of tokenize tokens to token.rst. (bpo-1911)
    5cefb6c

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Jun 6, 2017
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants