Issue 40546: Inconsistencies between PEG parser and traceback SyntaxErrors

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84726

classification

Title:	Inconsistencies between PEG parser and traceback SyntaxErrors
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.9

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	BTaskaya, gvanrossum, lys.nikolaou, pablogsal, terry.reedy
Priority:	normal	Keywords:

Created on 2020-05-07 16:05 by lys.nikolaou, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (13)
msg368353 - (view)	Author: Lysandros Nikolaou (lys.nikolaou) *	Date: 2020-05-07 16:05
Since traceback is programmed to match the SyntaxErrors emitted by the old parser, there are some inconsistencies between how it formats SyntaxErrors and how the new parser does it. One example for such behaviour is the following: New Parser: ╰─ ./python.exe Python 3.9.0a6+ (heads/master:4638c64295, May 7 2020, 16:47:53) [Clang 11.0.0 (clang-1100.0.33.8)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 1 \| 2 \| File "<stdin>", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax Old parser and traceback module: ╰─ ./python.exe -X oldparser Python 3.9.0a6+ (heads/master:4638c64295, May 7 2020, 16:47:53) [Clang 11.0.0 (clang-1100.0.33.8)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 1 \| 2 \| File "<stdin>", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax We actually think that the new parser does a better job here, since there is nothing wrong with the `\|` itself, the error is that a newline follows it and that's where the caret should point to. Tests that are testing this are currently skipped. I think we should change traceback to match the new parser, wherever we think the new parser does a better job, and "unskip" these tests.
msg368360 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-05-07 17:50
I don't understand why the traceback module is implicated. It just formats the information in the SyntaxError object, same as the builtin printing for syntax errors. The key difference is always in what line/column/text is put in the SyntaxError object by whichever parser is being used. In the past there was some misunderstanding about whether column numbers are 0-based (the leftmost column is numbered 0) or 1-based (the leftmost column is numbered 1), and at some point we discovered there was an inconsistency -- certain parts of the code put 0-based offsets in the SyntaxError object and other parts put 1-based offsets. We then decided that the SyntaxError column offset should be 1-based and changed various bits of code to match. It's however possible that we forgot some. It's also still not clearly documented (e.g. the stdlib docs for SyntaxError don't mention it). What complicates matters further is that the lowest-level C code in the tokenizer definitely uses 0-based offsets, which means that whenever we create a SyntaxError we have to add 1 to the offset. (You can see this happening if you look at various calls to PyErr_SyntaxLocationObject().)
msg368363 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-05-07 18:08
For example: raise SyntaxError("message", ("<file>", 1, 3, "text")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<file>", line 1 text ^ SyntaxError: message >>> (I can't find docs for the constructor of SyntaxError objects. We should update the docs.)
msg368364 - (view)	Author: Lysandros Nikolaou (lys.nikolaou) *	Date: 2020-05-07 18:14
Guido: > I don't understand why the traceback module is implicated. The traceback module is implicated, because it currently does not allow the caret to point to a position after the error line. In format_exception_only there is the following snippet of code: if offset is not None: caretspace = badline.rstrip('\n') offset = min(len(caretspace), offset) - 1 caretspace = caretspace[:offset].lstrip() # non-space whitespace (likes tabs) must be kept for alignment caretspace = ((c.isspace() and c or ' ') for c in caretspace) yield ' {}^\n'.format(''.join(caretspace)) There are two things here, that put together contradict the way that pegen reports errors. `badline` gets rstripped and then the offset is changed to point to the end of the string, in case it is currently pointing after its end. That basically does not allow SyntaxErrors to be formatted the way they currently are, when using the new parser. For example: ╰─ ./python.exe Python 3.9.0a6+ (heads/master:4638c64295, May 7 2020, 16:47:53) [Clang 11.0.0 (clang-1100.0.33.8)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> x = 1 \| 2 \| File "<stdin>", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax But: ╰─ ./python.exe Python 3.9.0a6+ (heads/master:4638c64295, May 7 2020, 16:47:53) [Clang 11.0.0 (clang-1100.0.33.8)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import traceback >>> try: ... exec('x = 1 \| 2 \|') ... except SyntaxError as se: ... exc = se ... >>> traceback.print_exception(type(exc), exc, exc.__traceback__) Traceback (most recent call last): File "<stdin>", line 2, in <module> File "<string>", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax I might be missing something here, but I think that that's the traceback module's "fault".
msg368368 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-05-07 19:08
Aha! So the traceback module does have a difference. First, if the offset points inside the text, there's no difference: >>> raise SyntaxError("message", ("<file>", 1, 4, "text")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<file>", line 1 text ^ SyntaxError: message traceback.print_exception(SyntaxError, SyntaxError("message", ("<file>", 1, 4, "text")), None) File "<file>", line 1 text ^ SyntaxError: message >>> But if the offset points past the text, there's a difference: >>> raise SyntaxError("message", ("<file>", 1, 5, "text")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<file>", line 1 text ^ SyntaxError: message >>> traceback.print_exception(SyntaxError, SyntaxError("message", ("<file>", 1, 5, "text")), None) File "<file>", line 1 text ^ SyntaxError: message >>> And even: >>> raise SyntaxError("message", ("<file>", 1, 10, "text")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<file>", line 1 text ^ SyntaxError: message >>> traceback.print_exception(SyntaxError, SyntaxError("message", ("<file>", 1, 10, "text")), None) File "<file>", line 1 text ^ SyntaxError: message >>> I wonder if we need to support the final case? It seems clear to me that if the col_offset points just past the text, we should display a caret just past the text. (I wonder if this might be an off-by-one issue caused by confusion about 0-based vs. 1-based offsets.) But if the col_offset points way past the text, what should happen? The clipping there seems reasonable.
msg368393 - (view)	Author: Lysandros Nikolaou (lys.nikolaou) *	Date: 2020-05-07 23:53
> But if the col_offset points way past the text, what should happen? The clipping there seems reasonable. Note that if a file is being parsed and not a string the behaviour is this: ╰─ cat -e t.py x = 1 \| 2 \| $ ╰─ ./python.exe t.py File "/Users/lysnikolaou/repos/cpython/t.py", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax ╰─ ./python.exe -X oldparser t.py File "/Users/lysnikolaou/repos/cpython/t.py", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax Would we want to also change this? If not, then I guess that we should as closely as possible match this behaviour with strings as well, which means not clipping.
msg368394 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-05-08 00:03
Hm, but there the old parser seems at fault: it points to the last space rather than beyond it?
msg368398 - (view)	Author: Lysandros Nikolaou (lys.nikolaou) *	Date: 2020-05-08 00:56
> Hm, but there the old parser seems at fault: it points to the last space rather than beyond it? Both parsers seem to have the same behaviour, when parsing files. I just found out that even the new parser points to the `\|` if there is not trailing whitespace: ╰─ cat -e t.py x = 1 \| 2 \|$ ╰─ ./python.exe t.py File "/Users/lysnikolaou/Repositories/cpython/t.py", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax ╰─ ./python.exe -X oldparser t.py File "/Users/lysnikolaou/Repositories/cpython/t.py", line 1 x = 1 \| 2 \| ^ SyntaxError: invalid syntax I'll investigate more tomorrow.
msg368472 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2020-05-08 21:37
I verified that 3.8 puts the caret under '\|', and agree that this is bug that should be fixed at least in a new version. Other such bugs have been fixed in the past. tk Text widgets have 1-based line numbers and 0-based column numbers. The latter is appropriate since column indexes mostly serve as slice positions rather than as character positions. IDLE changes the background of the error character to red, so it must be subtracting 1 to get the beginning of the slice to be colored. There was once a bug about this, so having the CPython behavior be consistent and documented would be good even if not a language feature. (For some reason I have not yet tracked down, IDLE colors the \n in 3.8 as well as 3.9, but is also correct and consistent for other exceptions where the two versions report the same column.) On a related note: Since the exceptions raised by compile are not restricted to SyntaxError (or Warning), it would be nice to have them restricted to a definite list that is documented. Both code.InteractiveInterpreter and IDLE have try-compile-except with (different) tuples extended as uncaught exceptions are reported.
msg368473 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-05-08 21:55
@Terry: Thanks for confirming the bug in the old parser. Regarding whether column offsets should be 0-based or 1-based: I agree with Tk, but my hand was forced for SyntaxError: we found that it was already using 1-based offsets and we found correct-looking code elsewhere that depended on this, so we had to change the few cases where it was using 0-based offsets, alas. I think I also looked at how vim and Emacs interpret column numbers, and found that both expect 1-based offsets in compiler error messages of the form "file:line:col: message". IIRC I had to do a deep-dive on this subject when we added column offsets to mypy error messages. But it's a pain! @Lysandros: I wonder if this is the fix we're looking for? diff --git a/Lib/traceback.py b/Lib/traceback.py index bf34bbab8a..0e286f60bc 100644 --- a/Lib/traceback.py +++ b/Lib/traceback.py @@ -582,7 +582,7 @@ class TracebackException: yield ' {}\n'.format(badline.strip()) if offset is not None: caretspace = badline.rstrip('\n') - offset = min(len(caretspace), offset) - 1 + offset = min(len(caretspace) + 1, offset) - 1 caretspace = caretspace[:offset].lstrip() # non-space whitespace (likes tabs) must be kept for alignment caretspace = ((c.isspace() and c or ' ') for c in caretspace)
msg368624 - (view)	Author: Lysandros Nikolaou (lys.nikolaou) *	Date: 2020-05-11 14:34
Guido: > I wonder if this is the fix we're looking for? I think this is the most sensible thing to do, yes. I guess, we also have to change both parsers to match this behaviour, right?
msg368625 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-05-11 14:46
I would carefully think about the meaning of that line and see if there are other permutations, e.g. min(Len(blah), offset - 1) ? It would,be nice to specify what we need here from both parsers and from both code chunks for error printing.
msg369310 - (view)	Author: Lysandros Nikolaou (lys.nikolaou) *	Date: 2020-05-18 23:38
Fixed in GH-20072 and bpo-40612.

History
Date	User	Action	Args
2022-04-11 14:59:30	admin	set	github: 84726
2020-05-18 23:38:02	lys.nikolaou	set	status: open -> closed resolution: fixed messages: + msg369310 stage: resolved
2020-05-11 14:46:42	gvanrossum	set	messages: + msg368625
2020-05-11 14:34:52	lys.nikolaou	set	messages: + msg368624
2020-05-08 21:55:57	gvanrossum	set	messages: + msg368473
2020-05-08 21:37:08	terry.reedy	set	nosy: + terry.reedy messages: + msg368472
2020-05-08 00:56:28	lys.nikolaou	set	messages: + msg368398
2020-05-08 00:03:29	gvanrossum	set	messages: + msg368394
2020-05-07 23:53:42	lys.nikolaou	set	messages: + msg368393
2020-05-07 19:08:40	gvanrossum	set	messages: + msg368368
2020-05-07 18:14:12	lys.nikolaou	set	messages: + msg368364
2020-05-07 18:08:07	gvanrossum	set	messages: + msg368363
2020-05-07 17:50:45	gvanrossum	set	messages: + msg368360
2020-05-07 16:14:06	BTaskaya	set	nosy: + BTaskaya
2020-05-07 16:08:05	lys.nikolaou	set	title: Inconsistencies between pegen and traceback SyntaxErrors -> Inconsistencies between PEG parser and traceback SyntaxErrors
2020-05-07 16:05:21	lys.nikolaou	create