classification
Title: Edge case in compiler when error displaying with non-utf8 lines
Type: Stage: resolved
Components: Parser Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ammar2, lys.nikolaou, miss-islington, pablogsal
Priority: normal Keywords: patch

Created on 2021-06-08 17:50 by ammar2, last changed 2021-06-09 00:29 by pablogsal. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 26611 merged pablogsal, 2021-06-08 19:03
PR 26616 merged miss-islington, 2021-06-08 23:54
Messages (6)
msg395347 - (view) Author: Ammar Askar (ammar2) * (Python committer) Date: 2021-06-08 17:50
The AST currently stores column offsets for characters as byte-offsets. However, when displaying errors, these byte-offsets must be turned into character-offsets so that the characters line up properly with the characters on the line when printed. This is done with the function `byte_offset_to_character_offset` (https://github.com/python/cpython/blob/fdc7e52f5f1853e350407c472ae031339ac7f60c/Parser/pegen.c#L142-L161) which assumes that the line is UTF8 encoded.

However, consider a file like this:

  '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

This prints

  File "test-normal.py", line 1
    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
                          ^^^^^^^^^^^^^^^^^^^^^^
  SyntaxError: Generator expression must be parenthesized

as expected.


However if we use a custom source encoding line:

  # -*- coding: cp437 -*-
  '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

it ends up printing out

  File "C:\Users\ammar\junk\test-utf16.py", line 2
    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
                                      ^^^^^^^^^^^^^^^^^^^^^^
  SyntaxError: Generator expression must be parenthesized

where the carets/offsets are misaligned with the actual characters. This is because the string "┬ó" has the display width of 2 characters and encodes to 2 bytes in cp437 but when interpreted as utf-8 is the single character "¢" with a display width of 1.

Note that this edge case is relatively hard to trigger because ordinarily what will happen here is that the call to PyErr_ProgramTextObject will fail because it tries to decode the line as utf-8: https://github.com/python/cpython/blob/ae3c66acb89a6104fcd0eea760f80a0287327cc4/Python/errors.c#L1693-L1696 after which the error handling logic uses the tokenizer's internal buffer which has a proper utf-8 string.
So this bug requires the input to be valid as both utf-8 and the source encoding.

(Discovered while implementing PEP 657 https://github.com/colnotab/cpython/issues/10)
msg395350 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-08 18:31
Lysandros, could you take a look?
msg395351 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-08 18:32
This affects also older versions:

python3.8 lel.py
  File "lel.py", line 3

                  ^
SyntaxError: Generator expression must be parenthesized
msg395354 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-08 19:04
I think the simplest solution is PR 26611. 

Ammar, can you check if that works for you?
msg395369 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-08 23:54
New changeset 9fd21f649d66dcb10108ee395fd68ed32c8239cd by Pablo Galindo in branch 'main':
bpo-44349: Fix edge case when displaying text from files with encoding in syntax errors (GH-26611)
https://github.com/python/cpython/commit/9fd21f649d66dcb10108ee395fd68ed32c8239cd
msg395370 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-09 00:29
New changeset c0496093e54edb78d2bd09b083b73e1e5b9e7242 by Miss Islington (bot) in branch '3.10':
bpo-44349: Fix edge case when displaying text from files with encoding in syntax errors (GH-26611) (GH-26616)
https://github.com/python/cpython/commit/c0496093e54edb78d2bd09b083b73e1e5b9e7242
History
Date User Action Args
2021-06-09 00:29:32pablogsalsetmessages: + msg395370
2021-06-08 23:55:24pablogsalsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-06-08 23:54:37miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request25200
2021-06-08 23:54:36pablogsalsetmessages: + msg395369
2021-06-08 19:04:33pablogsalsetmessages: + msg395354
2021-06-08 19:03:32pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request25194
2021-06-08 18:38:29ammar2settitle: Edge case in when error displaying with non-utf8 lines -> Edge case in compiler when error displaying with non-utf8 lines
2021-06-08 18:32:27pablogsalsetmessages: + msg395351
title: Edge case in pegen's error displaying with non-utf8 lines -> Edge case in when error displaying with non-utf8 lines
2021-06-08 18:31:28pablogsalsetmessages: + msg395350
2021-06-08 17:50:20ammar2create