Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize module does not mirror "end-of-input" is newline behavior #78080

Closed
ammaraskar opened this issue Jun 19, 2018 · 27 comments
Closed

Tokenize module does not mirror "end-of-input" is newline behavior #78080

ammaraskar opened this issue Jun 19, 2018 · 27 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ammaraskar
Copy link
Member

BPO 33899
Nosy @terryjreedy, @gpshead, @taleinat, @benjaminp, @ned-deily, @meadori, @asottile, @ammaraskar, @miss-islington, @asmeurer
PRs
  • bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior #7891
  • [3.7] bpo-33899: Make tokenize module mirror end-of-file is end-of-li… #8132
  • [2.7] bpo-33899: Make tokenize module mirror end-of-file is end-of-li… #8133
  • [3.6] bpo-33899: Make tokenize module mirror end-of-file is end-of-li… #8134
  • [2.7] bpo-33899: Revert tokenize module adding an implicit final NEWLINE #10072
  • bpo-33899: add mention of this change in What's New #10073
  • [3.7] bpo-33899: Mention tokenize behavior change in What's New (GH-10073) #10074
  • [3.6] bpo-33899: Mention tokenize behavior change in What's New (GH-10073) #10075
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ammaraskar'
    closed_at = <Date 2018-07-06.10:24:50.699>
    created_at = <Date 2018-06-19.07:41:52.217>
    labels = ['3.7', '3.8', 'type-bug', 'library']
    title = 'Tokenize module does not mirror "end-of-input" is newline behavior'
    updated_at = <Date 2019-03-22.11:46:47.154>
    user = 'https://github.com/ammaraskar'

    bugs.python.org fields:

    activity = <Date 2019-03-22.11:46:47.154>
    actor = 'brechtm'
    assignee = 'ammar2'
    closed = True
    closed_date = <Date 2018-07-06.10:24:50.699>
    closer = 'taleinat'
    components = ['Library (Lib)']
    creation = <Date 2018-06-19.07:41:52.217>
    creator = 'ammar2'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33899
    keywords = ['patch']
    message_count = 27.0
    messages = ['319934', '321154', '321162', '321163', '321164', '321165', '328220', '328222', '328226', '328227', '328238', '328283', '328318', '328324', '328353', '328354', '328355', '328356', '328357', '328358', '328359', '328360', '328369', '328383', '328877', '330213', '338601']
    nosy_count = 11.0
    nosy_names = ['terry.reedy', 'gregory.p.smith', 'taleinat', 'benjamin.peterson', 'ned.deily', 'meador.inge', 'brechtm', 'Anthony Sottile', 'ammar2', 'miss-islington', 'asmeurer']
    pr_nums = ['7891', '8132', '8133', '8134', '10072', '10073', '10074', '10075']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue33899'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @ammaraskar
    Copy link
    Member Author

    As was pointed out in https://bugs.python.org/issue33766 there is an edge case in the tokenizer whereby it will implicitly treat the end of input as a newline. The tokenize module in stdlib does not mirror the C code's behavior in this case.

    tokenizer.c:

    ~/cpython $ echo -n 'x' | ./python
    ----------
    NAME ("x")
    NEWLINE
    ENDMARKER

    tokenize module:

    ~/cpython $ echo -n 'x' | ./python -m tokenize
    1,0-1,1: NAME 'x'
    2,0-2,0: ENDMARKER ''

    The instrumentation to have the C tokenizer dump out its tokens is mine, can provide a diff to produce that output if needed.

    @ammaraskar ammaraskar added the 3.8 only security fixes label Jun 19, 2018
    @ammaraskar ammaraskar self-assigned this Jun 19, 2018
    @ammaraskar ammaraskar added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 19, 2018
    @taleinat
    Copy link
    Contributor

    taleinat commented Jul 6, 2018

    New changeset c4ef489 by Tal Einat (Ammar Askar) in branch 'master':
    bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891)
    c4ef489

    @taleinat
    Copy link
    Contributor

    taleinat commented Jul 6, 2018

    New changeset ab75d9e by Tal Einat (Ammar Askar) in branch '3.7':
    [3.7] bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891) (GH-8132)
    ab75d9e

    @taleinat
    Copy link
    Contributor

    taleinat commented Jul 6, 2018

    New changeset 11c36a3 by Tal Einat (Ammar Askar) in branch '3.6':
    [3.6] bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891) (GH-8134)
    11c36a3

    @taleinat
    Copy link
    Contributor

    taleinat commented Jul 6, 2018

    New changeset 7829bba by Tal Einat (Ammar Askar) in branch '2.7':
    [2.7] bpo-33899: Make tokenize module mirror end-of-file is end-of-line behavior (GH-7891) (bpo-8133)
    7829bba

    @taleinat
    Copy link
    Contributor

    taleinat commented Jul 6, 2018

    Thanks for all of your work on this, Ammar!

    @taleinat taleinat added the 3.7 (EOL) end of life label Jul 6, 2018
    @taleinat taleinat closed this as completed Jul 6, 2018
    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Oct 21, 2018

    This change in behaviour is breaking pycodestyle: PyCQA/pycodestyle#786

    Perhaps it shouldn't have been backported (especially all the way to python2.7?)

    @taleinat
    Copy link
    Contributor

    This was backported since it was considered a bug, but you are right that it broke backwards compatibility, and perhaps shouldn't have been backported.

    Still, with 3.6.6 and 3.7.1 now released, that ship has sailed.

    We could perhaps revert this on the 2.7 branch, but I feel that reverting this change only on 2.7 would just cause even more confusion.

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Oct 21, 2018

    I'm surprised this was classified as a bug! Though that's subjective so I get that it's difficult to decide what is and what isn't ¯\(ツ)

    @ned-deily
    Copy link
    Member

    Apparently this change also affected IPython. Perhaps we should add an entry to the whatsnew documents for 3.7.1 and 3.7.6:

    https://docs.python.org/3/whatsnew/3.7.html#notable-changes-in-python-3-7-1

    https://docs.python.org/3.6/whatsnew/3.6.html#notable-changes-in-python-3-6-7

    @taleinat
    Copy link
    Contributor

    I'm sorry to have caused this mess, it was bad judgement on my part.

    Adding mention in What's is a good idea, Ned, I'll do that.

    @taleinat
    Copy link
    Contributor

    Ned, should this also be added to the 2.7 What's New? Or perhaps reverted on the 2.7 branch?

    @ned-deily
    Copy link
    Member

    I don't have a strong opinion about 2.7 here. Ultimately, it's Benjamin's call. But it might make sense to revert for 2.7 since it hasn't been released yet.

    @benjaminp
    Copy link
    Contributor

    Please revert in 2.7.

    @taleinat
    Copy link
    Contributor

    See PR #54281 for reverting in 2.7.

    @gpshead
    Copy link
    Member

    gpshead commented Oct 24, 2018

    FYI, An example of other fallout from this change - patsy broke and needed this fix:

    pydata/patsy@4f53bba#diff-53c70e68c6dfd4fe9b08427792cb2bd6

    @taleinat
    Copy link
    Contributor

    See PR #54282 adding mention in "What's New".

    @gpshead
    Copy link
    Member

    gpshead commented Oct 24, 2018

    some pylint fallout appears to be addressed in pylint-dev/pylint@2698cbe

    @gpshead
    Copy link
    Member

    gpshead commented Oct 24, 2018

    New changeset dfba1f6 by Gregory P. Smith (Tal Einat) in branch 'master':
    bpo-33899: Mention tokenize behavior change in What's New (GH-10073)
    dfba1f6

    @miss-islington
    Copy link
    Contributor

    New changeset 9a04762 by Miss Islington (bot) (Tal Einat) in branch '3.6':
    [3.6] bpo-33899: Mention tokenize behavior change in What's New (GH-10073) (GH-10075)
    9a04762

    @miss-islington
    Copy link
    Contributor

    New changeset b4c9874 by Miss Islington (bot) (Tal Einat) in branch '3.7':
    [3.7] bpo-33899: Mention tokenize behavior change in What's New (GH-10073) (GH-10074)
    b4c9874

    @taleinat
    Copy link
    Contributor

    Thanks for helping with the fallout from this, Gregory.

    @terryjreedy
    Copy link
    Member

    bpo-33766 was about documenting the C tokenizer change, some years ago, that made end-of-file EOF and end-of-string EOS generate the NEWLINE token required to properly terminate statements. "The end of input also serves
    as an implicit terminator for the final physical line."

    Although the tokenizer module intentionally does not exactly mirror the C tokenizer (it adds COMMENT tokens), it plausibly seems like a bug that it was not changed along with the C tokenizer, as it has since been tokenizing valid code as grammatically invalid. But I agree that this fix is too disruptive for 2.7.

    @benjaminp
    Copy link
    Contributor

    New changeset a1f45ec by Benjamin Peterson (Tal Einat) in branch '2.7':
    bpo-33899: Revert tokenize module adding an implicit final NEWLINE (GH-10072)
    a1f45ec

    @gpshead
    Copy link
    Member

    gpshead commented Oct 29, 2018

    https://bugs.python.org/issue35107 filed to track further fallout from this API change.

    @asmeurer
    Copy link
    Mannequin

    asmeurer mannequin commented Nov 21, 2018

    Is it expected behavior that comments produce NEWLINE if they don't have a newline and don't produce NEWLINE if they do (that is, '# comment' produces NEWLINE but '# comment\n' does not)?

    @brechtm
    Copy link
    Mannequin

    brechtm mannequin commented Mar 22, 2019

    In order to adapt code to this change, can we assume that a NEWLINE token with an empty string only occurs right before the ENDMARKER?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants