Message 385313 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	romanows
Recipients	romanows
Date	2021-01-20.03:13:57
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1611112437.84.0.130779999528.issue42974@roundup.psfhosted.org>
In-reply-to

Content
The tokenize module's tokenizer functions output incorrect (or at least misleading) information when the content being tokenized does not end in a line ending character. This is related to the fix for issue<33899> which added the NEWLINE tokens for this case but did not fill out the whole token tuple correctly. The bug can be seen by running a version of the test in Lib/test/test_tokenize.py: import io, tokenize newline_token_1 = list(tokenize.tokenize(io.BytesIO("x\n".encode('utf-8')).readline))[-2] newline_token_2 = list(tokenize.tokenize(io.BytesIO("x".encode('utf-8')).readline))[-2] print(newline_token_1) print(newline_token_2) # Prints: # TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n') # TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='') # bad "end" and "line"! Notice that "len(newline_token_2.string) == 0" but "newline_token_2.end[1] - newline_token_2.start[1] == 1". Seems more consistent if the newline_token_2.end == (1, 1). Also, newline_token_2.line should hold the physical line rather than the empty string. This would make it consistent with newline_token_1.line. I'll add a PR shortly with a change so the output from the two cases is: TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n') TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='x') If this looks reasonable, I can backport it for the other branches. Thanks!

The tokenize module's tokenizer functions output incorrect (or at least misleading) information when the content being tokenized does not end in a line ending character.  This is related to the fix for issue<33899>  which added the NEWLINE tokens for this case but did not fill out the whole token tuple correctly.

The bug can be seen by running a version of the test in Lib/test/test_tokenize.py:

import io, tokenize

newline_token_1 = list(tokenize.tokenize(io.BytesIO("x\n".encode('utf-8')).readline))[-2]
newline_token_2 = list(tokenize.tokenize(io.BytesIO("x".encode('utf-8')).readline))[-2]

print(newline_token_1)
print(newline_token_2)

# Prints:
# TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n')
# TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='')  # bad "end" and "line"!

Notice that "len(newline_token_2.string) == 0" but "newline_token_2.end[1] - newline_token_2.start[1] == 1".  Seems more consistent if the newline_token_2.end == (1, 1).

Also, newline_token_2.line should hold the physical line rather than the empty string.  This would make it consistent with newline_token_1.line.

I'll add a PR shortly with a change so the output from the two cases is:

TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='x')

If this looks reasonable, I can backport it for the other branches.  Thanks!

History
Date	User	Action	Args
2021-01-20 03:13:57	romanows	set	recipients: + romanows
2021-01-20 03:13:57	romanows	set	messageid: <1611112437.84.0.130779999528.issue42974@roundup.psfhosted.org>
2021-01-20 03:13:57	romanows	link	issue42974 messages
2021-01-20 03:13:57	romanows	create