classification
Title: Inconsistencies between PEG parser and traceback SyntaxErrors
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, gvanrossum, lys.nikolaou, pablogsal, terry.reedy
Priority: normal Keywords:

Created on 2020-05-07 16:05 by lys.nikolaou, last changed 2020-05-18 23:38 by lys.nikolaou. This issue is now closed.

Messages (13)
msg368353 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-05-07 16:05
Since traceback is programmed to match the SyntaxErrors emitted by the old parser, there are some inconsistencies between how it formats SyntaxErrors and how the new parser does it. One example for such behaviour is the following:

New Parser:
╰─ ./python.exe
Python 3.9.0a6+ (heads/master:4638c64295, May  7 2020, 16:47:53)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 1 | 2 |
  File "<stdin>", line 1
    x = 1 | 2 |
               ^
SyntaxError: invalid syntax

Old parser and traceback module:
╰─ ./python.exe -X oldparser
Python 3.9.0a6+ (heads/master:4638c64295, May  7 2020, 16:47:53)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 1 | 2 |
  File "<stdin>", line 1
    x = 1 | 2 |
              ^
SyntaxError: invalid syntax

We actually think that the new parser does a better job here, since there is nothing wrong with the `|` itself, the error is that a newline follows it and that's where the caret should point to.

Tests that are testing this are currently skipped. I think we should change traceback to match the new parser, wherever we think the new parser does a better job, and "unskip" these tests.
msg368360 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-07 17:50
I don't understand why the traceback module is implicated. It just formats the information in the SyntaxError object, same as the builtin printing for syntax errors. The key difference is always in what line/column/text is put in the SyntaxError object by whichever parser is being used.

In the past there was some misunderstanding about whether column numbers are 0-based (the leftmost column is numbered 0) or 1-based (the leftmost column is numbered 1), and at some point we discovered there was an inconsistency -- certain parts of the code put 0-based offsets in the SyntaxError object and other parts put 1-based offsets.

We then decided that the SyntaxError column offset should be 1-based and changed various bits of code to match. It's however possible that we forgot some. It's also still not clearly documented (e.g. the stdlib docs for SyntaxError don't mention it).

What complicates matters further is that the lowest-level C code in the tokenizer definitely uses 0-based offsets, which means that whenever we create a SyntaxError we have to add 1 to the offset. (You can see this happening if you look at various calls to PyErr_SyntaxLocationObject().)
msg368363 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-07 18:08
For example:

raise SyntaxError("message", ("<file>", 1, 3, "text"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<file>", line 1
    text
      ^
SyntaxError: message
>>> 

(I can't find docs for the constructor of SyntaxError objects. We should update the docs.)
msg368364 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-05-07 18:14
Guido:
> I don't understand why the traceback module is implicated.

The traceback module is implicated, because it currently does not allow the caret to point to a position after the error line.

In format_exception_only there is the following snippet of code:

if offset is not None:
    caretspace = badline.rstrip('\n')
    offset = min(len(caretspace), offset) - 1
    caretspace = caretspace[:offset].lstrip()
    # non-space whitespace (likes tabs) must be kept for alignment
    caretspace = ((c.isspace() and c or ' ') for c in caretspace)
    yield '    {}^\n'.format(''.join(caretspace))

There are two things here, that put together contradict the way that pegen reports errors. `badline` gets rstripped and then the offset is changed to point to the end of the string, in case it is currently pointing after its end. That basically does not allow SyntaxErrors to be formatted the way they currently are, when using the new parser.

For example:
╰─ ./python.exe
Python 3.9.0a6+ (heads/master:4638c64295, May  7 2020, 16:47:53)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> x = 1 | 2 |
  File "<stdin>", line 1
    x = 1 | 2 |
               ^
SyntaxError: invalid syntax

But:
╰─ ./python.exe
Python 3.9.0a6+ (heads/master:4638c64295, May  7 2020, 16:47:53)
[Clang 11.0.0 (clang-1100.0.33.8)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import traceback
>>> try:
...     exec('x = 1 | 2 |')
... except SyntaxError as se:
...     exc = se
...
>>> traceback.print_exception(type(exc), exc, exc.__traceback__)
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<string>", line 1
    x = 1 | 2 |
              ^
SyntaxError: invalid syntax

I might be missing something here, but I think that that's the traceback module's "fault".
msg368368 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-07 19:08
Aha! So the traceback module does have a difference.

First, if the offset points inside the text, there's no difference:

>>> raise SyntaxError("message", ("<file>", 1, 4, "text"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<file>", line 1
    text
       ^
SyntaxError: message
traceback.print_exception(SyntaxError, SyntaxError("message", ("<file>", 1, 4, "text")), None)
  File "<file>", line 1
    text
       ^
SyntaxError: message
>>> 

But if the offset points past the text, there's a difference:

>>> raise SyntaxError("message", ("<file>", 1, 5, "text"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<file>", line 1
    text
        ^
SyntaxError: message
>>> traceback.print_exception(SyntaxError, SyntaxError("message", ("<file>", 1, 5, "text")), None)
  File "<file>", line 1
    text
       ^
SyntaxError: message
>>> 

And even:

>>> raise SyntaxError("message", ("<file>", 1, 10, "text"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<file>", line 1
    text
             ^
SyntaxError: message
>>> traceback.print_exception(SyntaxError, SyntaxError("message", ("<file>", 1, 10, "text")), None)
  File "<file>", line 1
    text
       ^
SyntaxError: message
>>> 

I wonder if we need to support the final case? It seems clear to me that if the col_offset points just past the text, we should display a caret just past the text. (I wonder if this might be an off-by-one issue caused by confusion about 0-based vs. 1-based offsets.)

But if the col_offset points way past the text, what should happen? The clipping there seems reasonable.
msg368393 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-05-07 23:53
> But if the col_offset points way past the text, what should happen? The clipping there seems reasonable.

Note that if a file is being parsed and not a string the behaviour is this:
╰─ cat -e t.py
x = 1 | 2 |    $
╰─ ./python.exe t.py
  File "/Users/lysnikolaou/repos/cpython/t.py", line 1
    x = 1 | 2 |
                  ^
SyntaxError: invalid syntax
╰─ ./python.exe -X oldparser t.py
  File "/Users/lysnikolaou/repos/cpython/t.py", line 1
    x = 1 | 2 |
                  ^
SyntaxError: invalid syntax

Would we want to also change this? If not, then I guess that we should as closely as possible match this behaviour with strings as well, which means not clipping.
msg368394 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-08 00:03
Hm, but there the old parser seems at fault: it points to the last space rather than beyond it?
msg368398 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-05-08 00:56
> Hm, but there the old parser seems at fault: it points to the last space rather than beyond it?

Both parsers seem to have the same behaviour, when parsing files. I just found out that even the new parser points to the `|` if there is not trailing whitespace:

╰─ cat -e t.py
x = 1 | 2 |$
╰─ ./python.exe t.py
  File "/Users/lysnikolaou/Repositories/cpython/t.py", line 1
    x = 1 | 2 |
              ^
SyntaxError: invalid syntax
╰─ ./python.exe -X oldparser t.py
  File "/Users/lysnikolaou/Repositories/cpython/t.py", line 1
    x = 1 | 2 |
              ^
SyntaxError: invalid syntax

I'll investigate more tomorrow.
msg368472 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-05-08 21:37
I verified that 3.8 puts the caret under '|', and agree that this is bug that should be fixed at least in a new version.  Other such bugs have been fixed in the past.

tk Text widgets have 1-based line numbers and 0-based column numbers. The latter is appropriate since column indexes mostly serve as slice positions rather than as character positions.  IDLE changes the background of the error character to red, so it must be subtracting 1 to get the beginning of the slice to be colored.  There was once a bug about this, so having the CPython behavior be consistent and documented would be good even if not a language feature.  (For some reason I have not yet tracked down, IDLE colors the \n in 3.8 as well as 3.9, but is also correct and consistent for other exceptions where the two versions report the same column.)

On a related note: Since the exceptions raised by compile are not restricted to SyntaxError (or Warning), it would be nice to have them restricted to a definite list that is documented.  Both code.InteractiveInterpreter and IDLE have try-compile-except with (different) tuples extended as uncaught exceptions are reported.
msg368473 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-08 21:55
@Terry: Thanks for confirming the bug in the old parser.

Regarding whether column offsets should be 0-based or 1-based: I agree with Tk, but my hand was forced for SyntaxError: we found that it was already using 1-based offsets and we found correct-looking code elsewhere that depended on this, so we had to change the few cases where it was using 0-based offsets, alas.

I think I also looked at how vim and Emacs interpret column numbers, and found that both expect 1-based offsets in compiler error messages of the form "file:line:col: message". IIRC I had to do a deep-dive on this subject when we added column offsets to mypy error messages.

But it's a pain!

@Lysandros:

I wonder if this is the fix we're looking for?

diff --git a/Lib/traceback.py b/Lib/traceback.py
index bf34bbab8a..0e286f60bc 100644
--- a/Lib/traceback.py
+++ b/Lib/traceback.py
@@ -582,7 +582,7 @@ class TracebackException:
             yield '    {}\n'.format(badline.strip())
             if offset is not None:
                 caretspace = badline.rstrip('\n')
-                offset = min(len(caretspace), offset) - 1
+                offset = min(len(caretspace) + 1, offset) - 1
                 caretspace = caretspace[:offset].lstrip()
                 # non-space whitespace (likes tabs) must be kept for alignment
                 caretspace = ((c.isspace() and c or ' ') for c in caretspace)
msg368624 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-05-11 14:34
Guido:
> I wonder if this is the fix we're looking for?

I think this is the most sensible thing to do, yes. I guess, we also have to change both parsers to match this behaviour, right?
msg368625 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-05-11 14:46
I would carefully think about the meaning of that line and see if there are other permutations, e.g. min(Len(blah), offset - 1) ?

It would,be nice to specify what we need here from both parsers and from both code chunks for error printing.
msg369310 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-05-18 23:38
Fixed in GH-20072 and bpo-40612.
History
Date User Action Args
2020-05-18 23:38:02lys.nikolaousetstatus: open -> closed
resolution: fixed
messages: + msg369310

stage: resolved
2020-05-11 14:46:42gvanrossumsetmessages: + msg368625
2020-05-11 14:34:52lys.nikolaousetmessages: + msg368624
2020-05-08 21:55:57gvanrossumsetmessages: + msg368473
2020-05-08 21:37:08terry.reedysetnosy: + terry.reedy
messages: + msg368472
2020-05-08 00:56:28lys.nikolaousetmessages: + msg368398
2020-05-08 00:03:29gvanrossumsetmessages: + msg368394
2020-05-07 23:53:42lys.nikolaousetmessages: + msg368393
2020-05-07 19:08:40gvanrossumsetmessages: + msg368368
2020-05-07 18:14:12lys.nikolaousetmessages: + msg368364
2020-05-07 18:08:07gvanrossumsetmessages: + msg368363
2020-05-07 17:50:45gvanrossumsetmessages: + msg368360
2020-05-07 16:14:06BTaskayasetnosy: + BTaskaya
2020-05-07 16:08:05lys.nikolaousettitle: Inconsistencies between pegen and traceback SyntaxErrors -> Inconsistencies between PEG parser and traceback SyntaxErrors
2020-05-07 16:05:21lys.nikolaoucreate