classification
Title: Improve error reporting for invalid character in source code
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: serhiy.storchaka
Priority: normal Keywords: patch

Created on 2020-05-11 10:49 by serhiy.storchaka, last changed 2020-05-12 09:42 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 20033 merged serhiy.storchaka, 2020-05-11 10:53
Messages (2)
msg368622 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-05-11 10:49
Currently you get SyntaxError with message "invalid character in identifier" in two cases:

1. The source code contains some non-ASCII non-identifier character. Usually it happens when you copy code from internet page or PDF file which was "improved" by some enhachaizer which replaces spaces with non-breacking  spaces, ASCII minus with a dash or Unicode minus, ASCII quotes with fancy Unicode quotes. They do not look like a part of identifier at all. The error message also does not say what character is invalid, and it is hard to find the culprit because they look too similar to correct characters (especially with some monospace fonts).

See https://mail.python.org/archives/list/python-ideas@python.org/thread/ILMNJ46EAL4ENYK7LLDLGIMYQKZAMMWU/ for discussion.

2. Other case is very special -- when the source code contains the declaration for the utf-8 encoding followed by non-UTF-8 bytes sequences. It is rarely happen in real world.

The proposed PR improves errors for these cases.

>>> print(123—45)
  File "<stdin>", line 1
    print(123—45)
             ^
SyntaxError: invalid character '—' (U+2014)

* The error message no longer contains misleading "in identifier".

* The error message contains the invalid character, literal and its hexcode.

* The caret points on the invalid character. Previously it pointed on the last non-ascii or non-alphabetical character followed the invalid character (5 in the above example).

* For the special case of non-decodable UTF-8 sequence the syntax error message is more informative: "(unicode error) 'utf-8' codec can't decode byte 0xff ...". Although this case needs further improvements.
msg368713 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-05-12 09:42
New changeset 74ea6b5a7501fb393cd567fb21998d0bfeeb267c by Serhiy Storchaka in branch 'master':
bpo-40593: Improve syntax errors for invalid characters in source code. (GH-20033)
https://github.com/python/cpython/commit/74ea6b5a7501fb393cd567fb21998d0bfeeb267c
History
Date User Action Args
2020-05-12 09:42:56serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-05-12 09:42:32serhiy.storchakasetmessages: + msg368713
2020-05-11 10:53:20serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request19342
2020-05-11 10:49:33serhiy.storchakacreate