Message 395347 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ammar2
Recipients	ammar2, lys.nikolaou, pablogsal
Date	2021-06-08.17:50:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1623174620.92.0.143887592684.issue44349@roundup.psfhosted.org>
In-reply-to

Content
The AST currently stores column offsets for characters as byte-offsets. However, when displaying errors, these byte-offsets must be turned into character-offsets so that the characters line up properly with the characters on the line when printed. This is done with the function `byte_offset_to_character_offset` (https://github.com/python/cpython/blob/fdc7e52f5f1853e350407c472ae031339ac7f60c/Parser/pegen.c#L142-L161) which assumes that the line is UTF8 encoded. However, consider a file like this: '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError This prints File "test-normal.py", line 1 '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError ^^^^^^^^^^^^^^^^^^^^^^ SyntaxError: Generator expression must be parenthesized as expected. However if we use a custom source encoding line: # -- coding: cp437 -- '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError it ends up printing out File "C:\Users\ammar\junk\test-utf16.py", line 2 '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError ^^^^^^^^^^^^^^^^^^^^^^ SyntaxError: Generator expression must be parenthesized where the carets/offsets are misaligned with the actual characters. This is because the string "┬ó" has the display width of 2 characters and encodes to 2 bytes in cp437 but when interpreted as utf-8 is the single character "¢" with a display width of 1. Note that this edge case is relatively hard to trigger because ordinarily what will happen here is that the call to PyErr_ProgramTextObject will fail because it tries to decode the line as utf-8: https://github.com/python/cpython/blob/ae3c66acb89a6104fcd0eea760f80a0287327cc4/Python/errors.c#L1693-L1696 after which the error handling logic uses the tokenizer's internal buffer which has a proper utf-8 string. So this bug requires the input to be valid as both utf-8 and the source encoding. (Discovered while implementing PEP 657 https://github.com/colnotab/cpython/issues/10)

The AST currently stores column offsets for characters as byte-offsets. However, when displaying errors, these byte-offsets must be turned into character-offsets so that the characters line up properly with the characters on the line when printed. This is done with the function `byte_offset_to_character_offset` (https://github.com/python/cpython/blob/fdc7e52f5f1853e350407c472ae031339ac7f60c/Parser/pegen.c#L142-L161) which assumes that the line is UTF8 encoded.

However, consider a file like this:

  '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

This prints

  File "test-normal.py", line 1
    '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
                          ^^^^^^^^^^^^^^^^^^^^^^
  SyntaxError: Generator expression must be parenthesized

as expected.


However if we use a custom source encoding line:

  # -*- coding: cp437 -*-
  '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

it ends up printing out

  File "C:\Users\ammar\junk\test-utf16.py", line 2
    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
                                      ^^^^^^^^^^^^^^^^^^^^^^
  SyntaxError: Generator expression must be parenthesized

where the carets/offsets are misaligned with the actual characters. This is because the string "┬ó" has the display width of 2 characters and encodes to 2 bytes in cp437 but when interpreted as utf-8 is the single character "¢" with a display width of 1.

Note that this edge case is relatively hard to trigger because ordinarily what will happen here is that the call to PyErr_ProgramTextObject will fail because it tries to decode the line as utf-8: https://github.com/python/cpython/blob/ae3c66acb89a6104fcd0eea760f80a0287327cc4/Python/errors.c#L1693-L1696 after which the error handling logic uses the tokenizer's internal buffer which has a proper utf-8 string.
So this bug requires the input to be valid as both utf-8 and the source encoding.

(Discovered while implementing PEP 657 https://github.com/colnotab/cpython/issues/10)

History
Date	User	Action	Args
2021-06-08 17:50:20	ammar2	set	recipients: + ammar2, lys.nikolaou, pablogsal
2021-06-08 17:50:20	ammar2	set	messageid: <1623174620.92.0.143887592684.issue44349@roundup.psfhosted.org>
2021-06-08 17:50:20	ammar2	link	issue44349 messages
2021-06-08 17:50:20	ammar2	create