Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new attribute to TokenInfo to report specific token IDs #46387

Closed
gpolo mannequin opened this issue Feb 17, 2008 · 16 comments
Closed

Add new attribute to TokenInfo to report specific token IDs #46387

gpolo mannequin opened this issue Feb 17, 2008 · 16 comments
Labels
docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@gpolo
Copy link
Mannequin

gpolo mannequin commented Feb 17, 2008

BPO 2134
Nosy @akuchling, @terryjreedy, @ncoghlan, @ezio-melotti, @merwok, @meadori, @ericsnowcurrently
Files
  • tokenize_sample.py
  • tokenize_r60884.diff
  • tokenize-exact-type-v0.patch
  • tokenize-exact-type-v1.patch
  • tokenize-docs-2.7-3.2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-01-19.06:49:13.996>
    created_at = <Date 2008-02-17.23:00:29.888>
    labels = ['type-feature', 'library', 'docs']
    title = 'Add new attribute to TokenInfo to report specific token IDs'
    updated_at = <Date 2012-01-19.06:49:13.995>
    user = 'https://bugs.python.org/gpolo'

    bugs.python.org fields:

    activity = <Date 2012-01-19.06:49:13.995>
    actor = 'meador.inge'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2012-01-19.06:49:13.996>
    closer = 'meador.inge'
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2008-02-17.23:00:29.888>
    creator = 'gpolo'
    dependencies = []
    files = ['9452', '9459', '24045', '24088', '24089']
    hgrepos = []
    issue_num = 2134
    keywords = ['patch']
    message_count = 16.0
    messages = ['62509', '62527', '99894', '112756', '149487', '149489', '149491', '149507', '149510', '149578', '149815', '149820', '150013', '150242', '151607', '151608']
    nosy_count = 10.0
    nosy_names = ['akuchling', 'terry.reedy', 'ncoghlan', 'gpolo', 'ezio.melotti', 'eric.araujo', 'meador.inge', 'docs@python', 'python-dev', 'eric.snow']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue2134'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @gpolo
    Copy link
    Mannequin Author

    gpolo mannequin commented Feb 17, 2008

    function generate_tokes at tokenize.py yields token OP (51) for colon,
    while it should be token COLON (11). It probably affects other python
    versions as well.

    I'm attaching a minor sample that demonstrates this, running it returns
    the following output:

    1 'if' (1, 0) (1, 2) if a == 2:
    1 'a' (1, 3) (1, 4) if a == 2:
    51 '==' (1, 5) (1, 7) if a == 2:
    2 '2' (1, 8) (1, 9) if a == 2:
    51 ':' (1, 9) (1, 10) if a == 2:
    1 'print' (2, 0) (2, 5) print 'hey'
    3 "'hey'" (2, 6) (2, 11) print 'hey'
    0 '' (3, 0) (3, 0)

    I didn't check if there are problems with other tokens, I noticed this
    with colon because I was trying to make some improvements on tabnanny.

    @gpolo gpolo mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Feb 17, 2008
    @gpolo
    Copy link
    Mannequin Author

    gpolo mannequin commented Feb 18, 2008

    I'm attaching a patch that solves this and updates tests.

    @akuchling
    Copy link
    Member

    Unfortunately I think this will break many users of tokenize.py.

    e.g. http://browsershots.googlecode.com/svn/trunk/devtools/pep8/pep8.py
    has code like:
    if (token_type == tokenize.OP and text in '([' and ...):

    If tokenize now returns LPAR, this code will no longer work correctly.
    Tools/i18n/pygettext.py, pylint, WebWare, pyfuscate, all have similar code. So I think we can't change the API this radically. Adding a parameter to enable more precise handling of tokens, and defaulting it to off, is probably the only way to change this.

    @terryjreedy
    Copy link
    Member

    I have not looked at this, but a new parameter would be a new feature. Its a moot point until there is an agreed on patch for a current version.

    @terryjreedy terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Aug 4, 2010
    @ncoghlan
    Copy link
    Contributor

    There are a *lot* of characters with semantic significance that are reported by the tokenize module as generic "OP" tokens:

    token.LPAR
    token.RPAR
    token.LSQB
    token.RSQB
    token.COLON
    token.COMMA
    token.SEMI
    token.PLUS
    token.MINUS
    token.STAR
    token.SLASH
    token.VBAR
    token.AMPER
    token.LESS
    token.GREATER
    token.EQUAL
    token.DOT
    token.PERCENT
    token.BACKQUOTE
    token.LBRACE
    token.RBRACE
    token.EQEQUAL
    token.NOTEQUAL
    token.LESSEQUAL
    token.GREATEREQUAL
    token.TILDE
    token.CIRCUMFLEX
    token.LEFTSHIFT
    token.RIGHTSHIFT
    token.DOUBLESTAR
    token.PLUSEQUAL
    token.MINEQUAL
    token.STAREQUAL
    token.SLASHEQUAL
    token.PERCENTEQUAL
    token.AMPEREQUAL
    token.VBAREQUAL
    token.CIRCUMFLEXEQUAL
    token.LEFTSHIFTEQUAL
    token.RIGHTSHIFTEQUAL
    token.DOUBLESTAREQUAL¶
    token.DOUBLESLASH
    token.DOUBLESLASHEQUAL
    token.AT

    However, I can't fault tokenize for deciding to treat all of those tokens the same way - for many source code manipulation purposes, these just need to be transcribed literally, and the "OP" token serves that purpose just fine.

    As the extensive test updates in the current patch suggest, AMK is also correct that changing this away from always returning "OP" tokens (even for characters with more specialised tokens available) would be a backwards incompatible change.

    I think there are two parts to this problem, one documentation related (affecting 2.7, 3.2, 3.3) and another that would be an actual change in 3.3:

    1. First, I think 3.3 should add an "exact_type" attribute to TokenInfo instances (without making it part of the tuple-based API). For most tokens, this would be the same as "type", but for OP tokens, it would provide the appropriate more specific token ID.

    2. Second, the tokenize module documentation should state *explicitly* which tokens it collapses down into the generic "OP" token, and explain how to use the "string" attribute to recover the more detailed information.

    @ncoghlan ncoghlan added the docs Documentation in the Doc dir label Dec 15, 2011
    @ncoghlan ncoghlan changed the title function generate_tokens at tokenize.py yields wrong token for colon Add new attribute to TokenInfo to report specific token IDs Dec 15, 2011
    @terryjreedy
    Copy link
    Member

    I believe that that list includes all symbols and symbol combinations that are syntactically significant in expressions. This is the generalized meaning of 'operator' that is being used. What do not appear are '#' which marks comments, '_' which is a name char, and '\' which escapes chars within strings. Other symbols within strings will also not be marked as OP tokens. The non-syntactic symbols '$' and '?' are also omitted.

    @ncoghlan
    Copy link
    Contributor

    Sure, but what does that have to do with anything? tokenize isn't a general purpose tokenizer, it's specifically for tokenizing Python source code.

    The *problem* is that it doesn't currently fully tokenize everything, but doesn't explicitly say that in the module documentation.

    Hence my proposed two-fold fix: document the current behaviour explicitly and also add a separate "exact_type" attribute for easy access to the detailed tokenization without doing your own string comparisons.

    @terryjreedy
    Copy link
    Member

    If you are responding to me, I am baffled. I gave a concise way to document the current behavior with respect to .OP, which you said you wanted.

    @ncoghlan
    Copy link
    Contributor

    Ah, I didn't read it as suggested documentation at all - you moved seamlessly from personal commentary to a docs suggestion without separating the two, so it appeared to be a complete non sequitur to me.

    As for the docs suggestion, I think it works as the explanation of which tokens are affected once the concept of the token stream simplification is introduced:
    =====
    To simplify token stream handling, all literal tokens (':', '{', etc) are returned using the generic 'OP' token type. This allows them to be simply handled using common code paths (e.g. for literal transcription directly from input to output). Specific tokens can be distinguished by checking the "string" attribute of OP tokens for a match with the expected character sequence.

    The affected tokens are all symbols and symbol combinations that are syntactically significant in expressions (as listed in the token module). Anything which is not an independent token (i.e. '#' which marks comments, '_' which is just part of a name, '\' which is used for line continuations, the contents of string literals and any symbols which are not a defined part of Python's syntax) is completely unaffected by this difference in behaviour.
    ===========

    If "exact_type" is introduced in 3.3, then the first paragraph can be adjusted accordingly.

    @terryjreedy
    Copy link
    Member

    Both the proposed text and 3.3 addition look good to me.

    @meadori
    Copy link
    Member

    meadori commented Dec 19, 2011

    The proposed documentation text seems too complicated and language expert speaky to me. We should try to link to standard definitions when possible to reduce the text here. For example, I believe the "Operators" and "Delimiters" tokens in the "Lexical Analysis" section of the docs (http://docs.python.org/dev/reference/lexical_analysis.html#operators) are exactly what we are trying to describe when referencing "literal tokens" and "affected tokens".

    I like Nick's idea to introduce a new attribute for the exact type, while keeping the tuple structure itself backwards compatible. Attached is a patch for 3.3 that updates the docs, adds exact_type, adds new unit tests, and adds a new CLI option for displaying token names using the exact type.

    An example of the new CLI option is:

    $ echo '1+2**4' | ./python -m tokenize
    1,0-1,1:            NUMBER         '1'            
    1,1-1,2:            OP             '+'            
    1,2-1,3:            NUMBER         '2'            
    1,3-1,5:            OP             '**'           
    1,5-1,6:            NUMBER         '4'            
    1,6-1,7:            NEWLINE        '\n'           
    2,0-2,0:            ENDMARKER      ''             
    $ echo '1+2**4' | ./python -m tokenize -e
    1,0-1,1:            NUMBER         '1'            
    1,1-1,2:            PLUS           '+'            
    1,2-1,3:            NUMBER         '2'            
    1,3-1,5:            DOUBLESTAR     '**'           
    1,5-1,6:            NUMBER         '4'            
    1,6-1,7:            NEWLINE        '\n'           
    2,0-2,0:            ENDMARKER      ''

    @ncoghlan
    Copy link
    Contributor

    Meador's patch looks good to me. The docs change for 2.7 and 3.2 would be similar, just with text like "Specific tokens can be distinguished by checking the string attribute of OP tokens for a match with the expected character sequence." replacing the reference to the new "exact_type" attribute.

    @merwok
    Copy link
    Member

    merwok commented Dec 21, 2011

    The cmdoption directive should be used with a program directive. See library/trace for an example of how to use it and to see the anchors and index entries it generates.

    @meadori
    Copy link
    Member

    meadori commented Dec 24, 2011

    The cmdoption directive should be used with a program directive.

    Ah, nice. Thanks for the tip Éric.

    Updated patch attached along with a patch for the 2.7/3.2 doc update attached.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 19, 2012

    New changeset 75baef657770 by Meador Inge in branch '2.7':
    Issue bpo-2134: Clarify token.OP handling rationale in tokenize documentation.
    http://hg.python.org/cpython/rev/75baef657770

    New changeset dfd74d752b0e by Meador Inge in branch '3.2':
    Issue bpo-2134: Clarify token.OP handling rationale in tokenize documentation.
    http://hg.python.org/cpython/rev/dfd74d752b0e

    New changeset f4976fa6e830 by Meador Inge in branch 'default':
    Issue bpo-2134: Add support for tokenize.TokenInfo.exact_type.
    http://hg.python.org/cpython/rev/f4976fa6e830

    @meadori
    Copy link
    Member

    meadori commented Jan 19, 2012

    Fixed. Thanks for the reviews everyone.

    @meadori meadori closed this as completed Jan 19, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants