This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author romanows
Recipients romanows
Date 2021-01-21.05:19:32
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
I took a look at Parser/tokenizer.c.  From what I can tell, the tokenizer does fake a newline character when the input buffer does not end with actual newline characters and that the returned NEWLINE token has an effective length of 1 because of this faked newline character.  That is, tok->cur - tok->start == 1 when the NEWLINE token is returned.

If this part of the C tokenizer is meant to be modeled exactly by the Python tokenize module, then the current code is correct.  If there is some wiggle room because tok->start and tok->cur are converted into line numbers and column offsets, then maybe it's acceptable to change them?  If not, then the current documentation is misleading because the newline_token_2.end[1] element from my original example is not "... <the> column where the token ends in the source".  There is no such column.

I'm not sure whether the C tokenizer exposes anything like newline_token_2.string, directly.  If so, does it hold the faked newline character or does it hold the empty string like the current tokenize module does?

I'm also not sure whether the C tokenizer exposes anything like newline_token_2.line.  If it does, I'd be surprised if the faked newline would cause this to somehow become the empty string instead of the actual line content.  So I'm guessing that current tokenize module's behavior here is still a real bug?  If not, then this is another case that might benefit from some edge-case documentation.
Date User Action Args
2021-01-21 05:19:32romanowssetrecipients: + romanows
2021-01-21 05:19:32romanowssetmessageid: <>
2021-01-21 05:19:32romanowslinkissue42974 messages
2021-01-21 05:19:32romanowscreate