Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case in compiler when error displaying with non-utf8 lines #88515

Closed
ammaraskar opened this issue Jun 8, 2021 · 6 comments
Closed

Edge case in compiler when error displaying with non-utf8 lines #88515

ammaraskar opened this issue Jun 8, 2021 · 6 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@ammaraskar
Copy link
Member

BPO 44349
Nosy @ammaraskar, @lysnikolaou, @pablogsal, @miss-islington
PRs
  • bpo-44349: Fix edge case when displaying text from files with encoding in syntax errors #26611
  • [3.10] bpo-44349: Fix edge case when displaying text from files with encoding in syntax errors (GH-26611) #26616
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-06-08.23:55:24.045>
    created_at = <Date 2021-06-08.17:50:20.904>
    labels = ['interpreter-core', '3.9', '3.10', '3.11']
    title = 'Edge case in compiler when error displaying with non-utf8 lines'
    updated_at = <Date 2021-06-09.00:29:32.909>
    user = 'https://github.com/ammaraskar'

    bugs.python.org fields:

    activity = <Date 2021-06-09.00:29:32.909>
    actor = 'pablogsal'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-06-08.23:55:24.045>
    closer = 'pablogsal'
    components = ['Parser']
    creation = <Date 2021-06-08.17:50:20.904>
    creator = 'ammar2'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 44349
    keywords = ['patch']
    message_count = 6.0
    messages = ['395347', '395350', '395351', '395354', '395369', '395370']
    nosy_count = 4.0
    nosy_names = ['ammar2', 'lys.nikolaou', 'pablogsal', 'miss-islington']
    pr_nums = ['26611', '26616']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue44349'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @ammaraskar
    Copy link
    Member Author

    The AST currently stores column offsets for characters as byte-offsets. However, when displaying errors, these byte-offsets must be turned into character-offsets so that the characters line up properly with the characters on the line when printed. This is done with the function byte_offset_to_character_offset (

    cpython/Parser/pegen.c

    Lines 142 to 161 in fdc7e52

    static inline Py_ssize_t
    byte_offset_to_character_offset(PyObject *line, Py_ssize_t col_offset)
    {
    const char *str = PyUnicode_AsUTF8(line);
    if (!str) {
    return 0;
    }
    Py_ssize_t len = strlen(str);
    if (col_offset > len + 1) {
    col_offset = len + 1;
    }
    assert(col_offset >= 0);
    PyObject *text = PyUnicode_DecodeUTF8(str, col_offset, "replace");
    if (!text) {
    return 0;
    }
    Py_ssize_t size = PyUnicode_GET_LENGTH(text);
    Py_DECREF(text);
    return size;
    }
    ) which assumes that the line is UTF8 encoded.

    However, consider a file like this:

    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

    This prints

    File "test-normal.py", line 1
    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
    ^^^^^^^^^^^^^^^^^^^^^^
    SyntaxError: Generator expression must be parenthesized

    as expected.

    However if we use a custom source encoding line:

    # -- coding: cp437 --
    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

    it ends up printing out

    File "C:\Users\ammar\junk\test-utf16.py", line 2
    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
    ^^^^^^^^^^^^^^^^^^^^^^
    SyntaxError: Generator expression must be parenthesized

    where the carets/offsets are misaligned with the actual characters. This is because the string "┬ó" has the display width of 2 characters and encodes to 2 bytes in cp437 but when interpreted as utf-8 is the single character "¢" with a display width of 1.

    Note that this edge case is relatively hard to trigger because ordinarily what will happen here is that the call to PyErr_ProgramTextObject will fail because it tries to decode the line as utf-8:

    cpython/Python/errors.c

    Lines 1693 to 1696 in ae3c66a

    res = PyUnicode_FromString(linebuf);
    if (res == NULL)
    _PyErr_Clear(tstate);
    return res;
    after which the error handling logic uses the tokenizer's internal buffer which has a proper utf-8 string.
    So this bug requires the input to be valid as both utf-8 and the source encoding.

    (Discovered while implementing PEP-657 colnotab#10)

    @ammaraskar ammaraskar added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jun 8, 2021
    @pablogsal
    Copy link
    Member

    Lysandros, could you take a look?

    @pablogsal
    Copy link
    Member

    This affects also older versions:

    python3.8 lel.py
    File "lel.py", line 3

                  ^
    

    SyntaxError: Generator expression must be parenthesized

    @pablogsal pablogsal changed the title Edge case in pegen's error displaying with non-utf8 lines Edge case in when error displaying with non-utf8 lines Jun 8, 2021
    @pablogsal pablogsal changed the title Edge case in pegen's error displaying with non-utf8 lines Edge case in when error displaying with non-utf8 lines Jun 8, 2021
    @ammaraskar ammaraskar changed the title Edge case in when error displaying with non-utf8 lines Edge case in compiler when error displaying with non-utf8 lines Jun 8, 2021
    @ammaraskar ammaraskar changed the title Edge case in when error displaying with non-utf8 lines Edge case in compiler when error displaying with non-utf8 lines Jun 8, 2021
    @pablogsal
    Copy link
    Member

    I think the simplest solution is PR 26611.

    Ammar, can you check if that works for you?

    @pablogsal
    Copy link
    Member

    New changeset 9fd21f6 by Pablo Galindo in branch 'main':
    bpo-44349: Fix edge case when displaying text from files with encoding in syntax errors (GH-26611)
    9fd21f6

    @pablogsal
    Copy link
    Member

    New changeset c049609 by Miss Islington (bot) in branch '3.10':
    bpo-44349: Fix edge case when displaying text from files with encoding in syntax errors (GH-26611) (GH-26616)
    c049609

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants