Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error reporting for invalid character in source code #84773

Closed
serhiy-storchaka opened this issue May 11, 2020 · 2 comments
Closed

Improve error reporting for invalid character in source code #84773

serhiy-storchaka opened this issue May 11, 2020 · 2 comments
Labels
3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

BPO 40593
Nosy @serhiy-storchaka
PRs
  • bpo-40593: Improve syntax errors for invalid characters in source code. #20033
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-05-12.09:42:56.822>
    created_at = <Date 2020-05-11.10:49:33.628>
    labels = ['interpreter-core', 'type-feature', '3.9']
    title = 'Improve error reporting for invalid character in source code'
    updated_at = <Date 2020-05-12.09:42:56.822>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2020-05-12.09:42:56.822>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-05-12.09:42:56.822>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2020-05-11.10:49:33.628>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40593
    keywords = ['patch']
    message_count = 2.0
    messages = ['368622', '368713']
    nosy_count = 1.0
    nosy_names = ['serhiy.storchaka']
    pr_nums = ['20033']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue40593'
    versions = ['Python 3.9']

    @serhiy-storchaka
    Copy link
    Member Author

    Currently you get SyntaxError with message "invalid character in identifier" in two cases:

    1. The source code contains some non-ASCII non-identifier character. Usually it happens when you copy code from internet page or PDF file which was "improved" by some enhachaizer which replaces spaces with non-breacking spaces, ASCII minus with a dash or Unicode minus, ASCII quotes with fancy Unicode quotes. They do not look like a part of identifier at all. The error message also does not say what character is invalid, and it is hard to find the culprit because they look too similar to correct characters (especially with some monospace fonts).

    See https://mail.python.org/archives/list/python-ideas@python.org/thread/ILMNJ46EAL4ENYK7LLDLGIMYQKZAMMWU/ for discussion.

    1. Other case is very special -- when the source code contains the declaration for the utf-8 encoding followed by non-UTF-8 bytes sequences. It is rarely happen in real world.

    The proposed PR improves errors for these cases.

    >>> print(12345)
      File "<stdin>", line 1
        print(123—45)
                 ^
    SyntaxError: invalid character '—' (U+2014)
    • The error message no longer contains misleading "in identifier".

    • The error message contains the invalid character, literal and its hexcode.

    • The caret points on the invalid character. Previously it pointed on the last non-ascii or non-alphabetical character followed the invalid character (5 in the above example).

    • For the special case of non-decodable UTF-8 sequence the syntax error message is more informative: "(unicode error) 'utf-8' codec can't decode byte 0xff ...". Although this case needs further improvements.

    @serhiy-storchaka serhiy-storchaka added 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels May 11, 2020
    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 74ea6b5 by Serhiy Storchaka in branch 'master':
    bpo-40593: Improve syntax errors for invalid characters in source code. (GH-20033)
    74ea6b5

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant