Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

untokenize() fails on tokenize output when a newline is missing #79288

Closed
gpshead opened this issue Oct 29, 2018 · 9 comments
Closed

untokenize() fails on tokenize output when a newline is missing #79288

gpshead opened this issue Oct 29, 2018 · 9 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes

Comments

@gpshead
Copy link
Member

gpshead commented Oct 29, 2018

BPO 35107
Nosy @terryjreedy, @gpshead, @taleinat, @meadori, @serhiy-storchaka, @ammaraskar, @pablogsal, @iritkatriel
Superseder
  • bpo-44667: tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-01-19.20:50:24.347>
    created_at = <Date 2018-10-29.22:26:38.314>
    labels = ['3.7', '3.8']
    title = 'untokenize() fails on tokenize output when a newline is missing'
    updated_at = <Date 2022-01-19.20:52:55.362>
    user = 'https://github.com/gpshead'

    bugs.python.org fields:

    activity = <Date 2022-01-19.20:52:55.362>
    actor = 'terry.reedy'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-01-19.20:50:24.347>
    closer = 'terry.reedy'
    components = []
    creation = <Date 2018-10-29.22:26:38.314>
    creator = 'gregory.p.smith'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35107
    keywords = []
    message_count = 9.0
    messages = ['328876', '328878', '328879', '328880', '328882', '328884', '328927', '410915', '410982']
    nosy_count = 8.0
    nosy_names = ['terry.reedy', 'gregory.p.smith', 'taleinat', 'meador.inge', 'serhiy.storchaka', 'ammar2', 'pablogsal', 'iritkatriel']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '44667'
    type = None
    url = 'https://bugs.python.org/issue35107'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @gpshead
    Copy link
    Member Author

    gpshead commented Oct 29, 2018

    The behavior change introduced in 3.6.7 and 3.7.1 via https://bugs.python.org/issue33899 has further consequences:

    >>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../cpython/cpython-upstream/Lib/tokenize.py", line 332, in untokenize
        out = ut.untokenize(iterable)
      File ".../cpython/cpython-upstream/Lib/tokenize.py", line 266, in untokenize
        self.add_whitespace(start)
      File ".../cpython/cpython-upstream/Lib/tokenize.py", line 227, in add_whitespace
        raise ValueError("start ({},{}) precedes previous end ({},{})"
    ValueError: start (1,1) precedes previous end (2,0)

    The same goes for using the documented tokenize API (generate_tokens is not documented):

    tokenize.untokenize(tokenize.tokenize(io.BytesIO(b'#').readline))
    ...
    ValueError: start (1,1) precedes previous end (2,0)
    
    

    untokenize() is no longer able to work on output of generate_tokens() if the input to generate_tokens() did not end in a newline.

    Today's workaround: Always append a newline if one is missing to the line that the readline callable passed to tokenize or generate_tokens returns. Very annoying to implement.

    @gpshead gpshead added 3.7 (EOL) end of life 3.8 only security fixes labels Oct 29, 2018
    @ammaraskar
    Copy link
    Member

    Looks like this is caused by this line here:

    cpython/Lib/tokenize.py

    Lines 551 to 558 in b83d917

    if line[pos] == '#':
    comment_token = line[pos:].rstrip('\r\n')
    yield TokenInfo(COMMENT, comment_token,
    (lnum, pos), (lnum, pos + len(comment_token)), line)
    pos += len(comment_token)
    yield TokenInfo(NL, line[pos:],
    (lnum, pos), (lnum, len(line)), line)

    which adds a newline token implicitly after comments. Since the input didn't terminate with a '\n', the code to add a newline at the end of input kicks in.

    @ammaraskar
    Copy link
    Member

    fwiw I think there's more at play here than the newline change. This is the behavior I get on 3.6.5 (before the newline change is applied). # works as expected but check out this input:

    >>> t.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
    '#'
    >>> t.untokenize(tokenize.generate_tokens(io.StringIO('x=1').readline))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "D:\Python365\lib\tokenize.py", line 272, in untokenize
        self.add_whitespace(start)
      File "D:\Python365\lib\tokenize.py", line 234, in add_whitespace
        .format(row, col, self.prev_row, self.prev_col))
    ValueError: start (1,0) precedes previous end (2,0)

    @gpshead
    Copy link
    Member Author

    gpshead commented Oct 30, 2018

    Interesting! I have a 3.6.2 sitting around and cannot reproduce that "x=1" behavior.

    I don't know what the behavior _should_ be. It just feels natural that untokenize should be able to round trip anything tokenize or generate_tokens emits without raising an exception.

    I'm filing this as the "#" case came up within some existing code we had that happened to effectively test that particular round trip.

    @ammaraskar
    Copy link
    Member

    Actually nevermind, disregard that, I was just testing it wrong. I think the simplest fix here is to add '#' to the list of characters here so we don't double insert newlines for comments:

    if last_line and last_line[-1] not in '\r\n':

    And a test for round tripping a file ending with a comment but no newline will allow that particular branch to be tested.

    I'll make a PR this week if no one else gets to it.

    @serhiy-storchaka
    Copy link
    Member

    I am surprised, that removing the newline character adds a token:

    >>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#\n').readline)))
    [TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#\n'),
     TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n'),
     TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
    >>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#').readline)))
    [TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'),
     TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
     TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''),
     TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

    @terryjreedy
    Copy link
    Member

    It seems to me a bug that if '\n' is not present, tokenize adds both NL and NEWLINE tokens, instead of just one of them. Moreover, both tuples of the double correction look wrong.

    If '\n' is present,
    TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n')
    looks correct.

    If NL represents a real character, the length 0 string='' in the generated
    TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
    seems wrong. I suspect that the idea was to mis-represent NL to avoid '\n' being added by untokenize. In
    TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='')
    string='' is mismatched by length = 2-1 = 1. I am inclined to think that the following would be the correct added token, which should untokenize correctly
    TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='')

    ast.dump(ast.parse(s)) returns 'Module(body=[])' for both versions of 's', so no help there.

    @iritkatriel
    Copy link
    Member

    I am unable to reproduce this on 3.11:

    >>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
    '#'

    @terryjreedy
    Copy link
    Member

    bpo-44667 tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
    Fixed on 3.11, 3.10, 3.9 Aug 2021.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sachaa-Thanasius added a commit to Sachaa-Thanasius/jishaku that referenced this issue Apr 14, 2024
    - Fix test break, since `tokenize.tokenize` has buggy behavior that wasn't backported.
     - See python/cpython#79288 and python/cpython#88833.
    - Adjust typing using so everything from typing is prepended with `typing.`.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants