Issue 46274: Tokenizer module does not handle backslash characters correctly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90432

classification

Title:	Tokenizer module does not handle backslash characters correctly
Type:		Stage:
Components:	Parser	Versions:	Python 3.11, Python 3.10, Python 3.9, Python 3.8, Python 3.7, Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	lys.nikolaou, ucodery
Priority:	normal	Keywords:

Created on 2022-01-05 23:12 by ucodery, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg409811 - (view)	Author: Jeremy (ucodery) *	Date: 2022-01-05 23:12
A source of one or more backslash-escaped newlines, and one final newline, is not tokenized the same as a source where those lines are "manually joined". The source ``` \ \ \ ``` produces the tokens NEWLINE, ENDMARKER when piped to the tokenize module. Whereas the source ``` ``` produces the tokens NL, ENDMARKER. What I expect is to receive only one NL token from both sources. As per the documentation "Two or more physical lines may be joined into logical lines using backslash characters" ... "A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated)" And, because these logical lines are not being ignored, if there are spaces/tabs, INDENT and DEDENT tokens are also being unexpectedly produced. The source ``` \ ``` produces the tokens INDENT, NEWLINE, DEDENT, ENDMARKER. Whereas the source (with spaces) ``` ``` produces the tokens NL, ENDMARKER.

msg409811 - (view)

Author: Jeremy (ucodery) *

Date: 2022-01-05 23:12

A source of one or more backslash-escaped newlines, and one final newline, is not tokenized the same as a source where those lines are "manually joined".

The source
```
\
\
\

```
produces the tokens NEWLINE, ENDMARKER when piped to the tokenize module.

Whereas the source
```

```
produces the tokens NL, ENDMARKER.

What I expect is to receive only one NL token from both sources. As per the documentation "Two or more physical lines may be joined into logical lines using backslash characters" ... "A logical line that contains only spaces, tabs, formfeeds and possibly a comment, is ignored (i.e., no NEWLINE token is generated)"

And, because these logical lines are not being ignored, if there are spaces/tabs, INDENT and DEDENT tokens are also being unexpectedly produced.

The source
```
    \

```
produces the tokens INDENT, NEWLINE, DEDENT, ENDMARKER.

Whereas the source (with spaces)
```
    
```
produces the tokens NL, ENDMARKER.

History
Date	User	Action	Args
2022-04-11 14:59:54	admin	set	github: 90432
2022-01-06 13:57:33	pablogsal	set	title: backslash creating statement out of nothing -> Tokenizer module does not handle backslash characters correctly
2022-01-06 09:46:31	pablogsal	set	nosy: - pablogsal
2022-01-05 23:12:47	ucodery	create