Issue 42974: tokenize reports incorrect end col offset and line string when input ends without explicit newline

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87140

classification

Title:	tokenize reports incorrect end col offset and line string when input ends without explicit newline
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	lys.nikolaou, pablogsal, romanows, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2021-01-20 03:13 by romanows, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 24260	open	romanows, 2021-01-20 03:29

Messages (5)
msg385313 - (view)	Author: Brian Romanowski (romanows) *	Date: 2021-01-20 03:13
The tokenize module's tokenizer functions output incorrect (or at least misleading) information when the content being tokenized does not end in a line ending character. This is related to the fix for issue<33899> which added the NEWLINE tokens for this case but did not fill out the whole token tuple correctly. The bug can be seen by running a version of the test in Lib/test/test_tokenize.py: import io, tokenize newline_token_1 = list(tokenize.tokenize(io.BytesIO("x\n".encode('utf-8')).readline))[-2] newline_token_2 = list(tokenize.tokenize(io.BytesIO("x".encode('utf-8')).readline))[-2] print(newline_token_1) print(newline_token_2) # Prints: # TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n') # TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='') # bad "end" and "line"! Notice that "len(newline_token_2.string) == 0" but "newline_token_2.end[1] - newline_token_2.start[1] == 1". Seems more consistent if the newline_token_2.end == (1, 1). Also, newline_token_2.line should hold the physical line rather than the empty string. This would make it consistent with newline_token_1.line. I'll add a PR shortly with a change so the output from the two cases is: TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n') TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='x') If this looks reasonable, I can backport it for the other branches. Thanks!
msg385387 - (view)	Author: Brian Romanowski (romanows) *	Date: 2021-01-21 01:52
Shoot, just realized that consistency isn't the important thing here, the most important thing is that the tokenizer module exactly matches the output of the Python tokenizer. It's possible that my changes violate that constraint, I'll take a look at that.
msg385392 - (view)	Author: Brian Romanowski (romanows) *	Date: 2021-01-21 05:19
I took a look at Parser/tokenizer.c. From what I can tell, the tokenizer does fake a newline character when the input buffer does not end with actual newline characters and that the returned NEWLINE token has an effective length of 1 because of this faked newline character. That is, tok->cur - tok->start == 1 when the NEWLINE token is returned. If this part of the C tokenizer is meant to be modeled exactly by the Python tokenize module, then the current code is correct. If there is some wiggle room because tok->start and tok->cur are converted into line numbers and column offsets, then maybe it's acceptable to change them? If not, then the current documentation is misleading because the newline_token_2.end[1] element from my original example is not "... <the> column where the token ends in the source". There is no such column. I'm not sure whether the C tokenizer exposes anything like newline_token_2.string, directly. If so, does it hold the faked newline character or does it hold the empty string like the current tokenize module does? I'm also not sure whether the C tokenizer exposes anything like newline_token_2.line. If it does, I'd be surprised if the faked newline would cause this to somehow become the empty string instead of the actual line content. So I'm guessing that current tokenize module's behavior here is still a real bug? If not, then this is another case that might benefit from some edge-case documentation.
msg385553 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-01-23 17:50
The question in the issue for tokenizer and virtual newlines is mimicking the C tokenizer versus consistency between and within token 5-tuples. As a tokenizer user, I would lean toward consistency but it is definitely not my call. https://github.com/python/cpython/pull/24260/files has several tokenize.c warnings like this: '=': conversion from '__int64' to 'int', possible loss of data
msg385916 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-01-29 16:19
Do you still need my opinion? I can't say that I have one ATM.

History
Date	User	Action	Args
2022-04-11 14:59:40	admin	set	github: 87140
2021-02-20 01:33:57	gvanrossum	set	nosy: - gvanrossum
2021-01-29 16:19:03	gvanrossum	set	messages: + msg385916
2021-01-23 17:50:19	terry.reedy	set	versions: - Python 3.6, Python 3.7 nosy: + terry.reedy, pablogsal, gvanrossum, lys.nikolaou messages: + msg385553 type: behavior
2021-01-21 05:19:32	romanows	set	messages: + msg385392
2021-01-21 01:52:39	romanows	set	messages: + msg385387
2021-01-20 03:29:44	romanows	set	keywords: + patch stage: patch review pull_requests: + pull_request23085
2021-01-20 03:13:57	romanows	create